Improving the Boilerpipe Algorithm for Boilerplate Removal in News Articles Using HTML Tree Structure

descriptionPublicationkeyboard_double_arrow_right Article 01 Jul 2018Publisher:Instituto Politecnico Nacional/Centro de Investigacion en ComputacionJournal:Computación y Sistemas, volume 22 (issn: 1405-5546, eissn: 2007-9737,

Copyright policy )

Authors: Francisco Viveros Jiménez; Miguel A. Sánchez-Pérez; Helena Gómez-Adorno; Juan Pablo Posadas-Durán; Grigori Sidorov; Alexander F. Gelbukh;

doi: 10.13053/cys-22-2-2959

Improving the Boilerpipe Algorithm for Boilerplate Removal in News Articles Using HTML Tree Structure

- Summary
- Metrics

Abstract

It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. More over, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to extract the relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.) is not a trivial task. There are many algorithms for this purpose described in the literature. Boilerpipe is one of the most popular one sand its performance is one of the best. In this paper, we improve the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. We make the experiments for the news articles. We evaluated our approach by extracting news from English and Spanish websites and compared the results with other approaches. Our approach achieved better results than approaches from the state-of-the-art. We also present an analysis of our dataset confirming that the amount of relevant text is less than 40%.

Related Organizations

Instituto Politécnico Nacional
Mexico

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	4
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

4

Average

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now