• shareshare
  • link
  • cite
  • add
Publication . Article . 2019

An intrinsic evaluation of the Waterloo spam rankings of the ClueWeb09 and ClueWeb12 datasets

İbrahim Barış Yılmazel; Ahmet Arslan;
Closed Access
Published: 08 Aug 2019 Journal: Journal of Information Science, volume 47, pages 41-57 (issn: 0165-5515, eissn: 1741-6485, Copyright policy )
Publisher: SAGE Publications

The ClueWeb09 dataset and its successor, the ClueWeb12 dataset, are two of the largest collections of Web pages released by Text REtrieval Conference (TREC). The ClueWeb datasets were used in various tracks of TREC ran through 2009 to 2017. For every year, approximately 50 new queries are released and a pool of Web pages are judged against these queries by human assessors as relevant, non-relevant or spam. In this article, a ground truth for binary classification (spam vs non-spam) is constructed from Web pages that are judged as spam or relevant under the assumption that a Web page judged as relevant for any query cannot be spam. Based on this ground truth, we evaluate classification performances of the Waterloo spam rankings (Fusion, Britney, GroupX and UK2006), which have been traditionally used to identify and filter spam pages in retrieval systems. The experimental results in terms of the universal binary classification evaluation measures suggest that the Fusion (with threshold = 11%) is the best for the ClueWeb09 dataset. Analysis of the frequency distributions of relevant/spam documents over spam scores reveals that the GroupX is the most powerful at identifying relevant documents, whereas the Fusion is the most powerful at identifying spam documents. It is also confirmed that the effectiveness of the Fusion spam ranking of the ClueWeb12 dataset is not as good as that of the ClueWeb09.

Subjects by Vocabulary

Microsoft Academic Graph classification: Web page Spamdexing Information retrieval Text Retrieval Conference Web retrieval Successor cardinal Computer science

ACM Computing Classification System: InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ComputingMethodologies_PATTERNRECOGNITION


Library and Information Sciences, Information Systems