publication . Other literature type . Conference object . 2010

Boilerplate detection using shallow text features

Christian Kohlschütter; Peter Fankhauser; Wolfgang Nejdl;
Open Access
  • Published: 01 Jan 2010
  • Publisher: Association for Computing Machinery (ACM)
Abstract
In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impa...
Subjects
free text keywords: Information retrieval, Template, Boilerplate text, Heuristics, Web page, Stochastic modelling, Small set, Computer science
Related Organizations
Any information missing or wrong?Report an Issue