Actions
  • shareshare
  • link
  • cite
  • add
add
Publication . Conference object . 2010

Using anchor text, spam filtering and Wikipedia for web search and entity ranking

Kamps, J.; Kaptein, R.; Koolen, M.; Voorhees, E.M.; Buckland, L.P.;
Open Access
English
Published: 01 Jan 2010
Publisher: National Institute for Standards and Technology
Country: Netherlands
Abstract

In this paper, we document our efforts in participating to the TREC 2010 Entity Ranking and Web Tracks. We had multiple aims: For the Web Track we wanted to compare the effectiveness of anchor text of the category A and B collections and the impact of global document quality measures such as PageRank and spam scores. We find that documents in ClueWeb09 category B have a higher probability of being retrieved than other documents in category A. In ClueWeb09 category B, spam is mainly an issue for full-text retrieval. Anchor text suffers little from spam. Spam scores can be used to filter spam but also to find key resources. Documents that are least likely to be spam tend to be high-quality results. For the Entity Ranking Track, we use Wikipedia as a pivot to find relevant entities on the Web. Using category information to retrieve entities within Wikipedia leads to large improvements. Although we achieve large improvements over our baseline run that does not use category information, our best scores are still weak. Following the external links on Wikipedia pages to find the homepages of the entities in the ClueWeb collection, works better than searching an anchor text index, and combining the external links with searching an anchor text index.

Subjects by Vocabulary

ACM Computing Classification System: InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL

[1] G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets. CoRR, abs/1004.5168, 2010. [OpenAIRE]

[2] D. Hiemstra and C. Hauff. MIREX: MapReduce Information Retrieval Experiments. Technical Report TR-CTIT-10-15, 2010. ISSN 1381-3625. http:// eprints.eemcs.utwente.nl/17797/. [OpenAIRE]

[3] Indri. Language modeling meets inference networks, 2009. http://www.lemurproject.org/ indri/.

[4] J. Kamps. Effective smoothing for a terabyte of text. In E. M. Voorhees and L. P. Buckland, editors, The Fourteenth Text REtrieval Conference (TREC 2005). National Institute of Standards and Technology. NIST Special Publication 500-266, 2006.

[5] J. Kamps. Experiments with document and query representations for a terabyte of text. In E. M. Voorhees and L. P. Buckland, editors, The Fifteenth Text REtrieval Conference (TREC 2006). National Institute of Standards and Technology. NIST Special Publication 500- 272, 2007.

[6] R. Kaptein, M. Koolen, and J. Kamps. Result diversity and entity ranking experiments: Text, anchors, links, and wikipedia. In E. M. Voorhees and L. P. Buckland, editors, The Eighteenth Text REtrieval Conference Proceedings (TREC 2009). National Institute for Standards and Technology. NIST Special Publication, 2010. [OpenAIRE]

[7] W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry page search. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 27-34. ACM Press, New York NY, USA, 2002. [OpenAIRE]

Funded by
NWO| EfFoRT - Effective Focused Retrieval Techniques
Project
  • Funder: Netherlands Organisation for Scientific Research (NWO) (NWO)
  • Project Code: 2300132503
,
NWO| README - Retrieving Encoded Archival Descriptions More Effectively
Project
  • Funder: Netherlands Organisation for Scientific Research (NWO) (NWO)
  • Project Code: 2300134704
,
NWO| MuSeUM - Multiple-collection Searching Using Metadata
Project
  • Funder: Netherlands Organisation for Scientific Research (NWO) (NWO)
  • Project Code: 2300129448
Download from
lock_open
NARCIS
Conference object . 2010
Providers: NARCIS
moresidebar