Actions
  • shareshare
  • link
  • cite
  • add
add
auto_awesome_motion View all 2 versions
Publication . Conference object . 2017

news-please

Hamborg, Felix; Meuschke, Norman; Breitinger, Corinna; Gipp, Bela;
Open Access   English  
Published: 24 Mar 2017
Publisher: Humboldt-Universität zu Berlin
Country: Germany
Abstract
The amount of news published and read online has increased tremendously in recent years, making news data an interesting resource for many research disciplines, such as the social sciences and linguistics. However, large scale collection of news data is cumbersome due to a lack of generic tools for crawling and extracting such data. We present news-please, a generic, multilanguage, open-source crawler and extractor for news that works out-of-the-box for a large variety of news websites. Our system allows crawling arbitrary news websites and extracting the major elements of news articles on those websites, i.e., title, lead paragraph, main content, publication date, author, and main image. Compared to existing tools, news-please features full website extraction requiring only the root URL.
Subjects by Vocabulary

Dewey Decimal Classification: ddc:020

ACM Computing Classification System: InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL InformationSystems_MISCELLANEOUS

Subjects

news crawler, news extractor, scraper, information extraction, 020 Bibliotheks- und Informationswissenschaft, news crawler, news extractor, scraper, information extraction, 020 Bibliotheks- und Informationswissenschaft

Baburov, Y. (2010): python-readability. https://github.com/buriy/python-readability

Geva, R. (2016): article-date-extractor. https://github.com/Webhose/article-date-extractor

Kohlschütter, C., P. Fankhauser, and W. Nejdl (2010): Boilerplate detection using shallow text features. In: Proceedings of the third ACM international conference on Web search and data mining (pp. 441-450). ACM. [OpenAIRE]

Kouzis-Loukas, D. (2016): Learning Scrapy. Packt Publishing Ltd.

Labs, G. (2016): Goose - Article Extractor. https://github.com/GravityLabs/goose

Lewis, D. D., Y. Yang, T. G. Rose, and F. Li (2004): Rcv1: A new benchmark collection for text categorization research. In: Journal of machine learning research, 5 (Apr), 361-397.

Meschenmoser, P., N. Meuschke, M. Hotz, B. Gipp (2016): Scraping Scientific Web Repositories: Challenges and Solutions for Automated Content Extraction. In: D-Lib Magazine, 22 (9/10). [OpenAIRE]

Ou-Yang, L. (2013): Newspaper: Article scraping & curation. http://newspaper.readthedocs.io/en/latest/

Paliouras, G., A. Mouzakidis, V. Moustakas, C. Skourlas, C. (2008): PNS: A personalized news aggregator on the web. In: Intelligent interactive systems in knowledge-based environments (pp. 175-197). Springer.