Term frequency with average term occurrences for textual information retrieval

Article English OPEN
Ibrahim, O. ; Landa-Silva, Dario (2016)

In the context of Information Retrieval (IR) from text documents, the term-weighting scheme (TWS) is a key component of the matching mechanism when using the vector space model (VSM). In this paper we propose a new TWS that is based on computing the average term occurrences of terms in documents and it also uses a discriminative approach based on the document centroid vector to remove less significant weights from the documents. We call our approach Term Frequency With Average Term Occurrence (TF-ATO). An analysis of commonly used document collections shows that test collections are not fully judged as achieving that is expensive and may be infeasible for large collections. A document collection being fully judged means that every document in the collection acts as a relevant document to a specific query or a group of queries. The discriminative approach used in our proposed approach is a heuristic method for improving the IR effectiveness and performance, and it has the advantage of not requiring previous knowledge about relevance judgements. We compare the performance of the proposed TF-ATO to the well-known TF-IDF approach and show that using TF-ATO results in better effectiveness in both static and dynamic document collections. In addition, this paper investigates the impact that stop-words removal and our discriminative approach have on TFIDF and TF-ATO. The results show that both, stopwords removal and the discriminative approach, have a positive effect on both term-weighting schemes. More importantly, it is shown that using the proposed discriminative approach is beneficial for improving IR effectiveness and performance with no information in the relevance judgement for the collection.
  • References (47)
    47 references, page 1 of 5

    Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.

    Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. Modern Information Retrieval - the concepts and technology behind search, Second edition. Pearson Education Ltd., Harlow, England, 2nd editio edition, 2011.

    C. H. Chang and C. C. Hsu. The design of an information system for hypertext retrieval and automatic discovery on WWW. PhD thesis, National Taiwan University, 1999.

    O. Cordan, E. Herrera-Viedma, C. Lapez-Pujalte, M. Luque, and C. Zarco. A review on the application of evolutionary computation to information retrieval. International Journal of Approximate Reasoning, 34 (23):241 { 264, 2003. Soft Computing Applications to Intelligent Information Retrieval on the Internet.

    Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-Interscience, New York, NY, USA, 1991.

    Ronan Cummins. The Evolution and Analysis of TermWeighting Schemes in Information Retrieval. PhD thesis, National University of Ireland, Galway, 2008.

    Ronan Cummins and Colm O'Riordan. Term-weighting in information retrieval using genetic programming: A three stage process. In Proceedings of the 2006 Conference on ECAI 2006: 17th European Conference on Arti cial Intelligence August 29 { September 1, 2006, Riva Del Garda, Italy, pages 793{794, Amsterdam, The Netherlands, The Netherlands, 2006. IOS Press.

    Christopher Fox. Information retrieval. chapter Lexical Analysis and Stoplists, pages 102{130. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1992.

    Ed Greengrass. Information Retrieval : A Survey. Technical Report November, University of Maryland, USA, 2000. URL http://www.csee.umbc.edu/csee/research/ cadip/readings/IR.report.120600.book.pdf.

    William Hersh, Chris Buckley, T. J. Leone, and David Hickam. Ohsumed: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '94, pages 192{201, New York, NY, USA, 1994. Springer-Verlag New York, Inc.

  • Metrics
    No metrics available
Share - Bookmark