
It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. More over, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to extract the relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.) is not a trivial task. There are many algorithms for this purpose described in the literature. Boilerpipe is one of the most popular one sand its performance is one of the best. In this paper, we improve the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. We make the experiments for the news articles. We evaluated our approach by extracting news from English and Spanish websites and compared the results with other approaches. Our approach achieved better results than approaches from the state-of-the-art. We also present an analysis of our dataset confirming that the amount of relevant text is less than 40%.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 4 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
