publication . Conference object . Preprint . 2016

Let's Replace Term-Based Retrieval with k-NN Search

Boytsov, Leonid; Novak, David; Malkov, Yury; Nyberg, Eric;
Open Access
  • Published: 31 Oct 2016
  • Publisher: ACM
Abstract
Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective a...
Subjects
free text keywords: Software, business.industry, business, Computer science, Theoretical computer science, Stack overflow, Pipeline transport, Vocabulary mismatch, Retrieval algorithm, Data mining, computer.software_genre, computer, Information retrieval, Computer Science - Information Retrieval
Related Organizations
77 references, page 1 of 6

[1] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal lsh for angular distance. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1225-1233. Curran Associates, Inc., 2015.

[2] S. Arya and D. M. Mount. Approximate nearest neighbor queries in fixed dimensions. In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, pages 271-280. Society for Industrial and Applied Mathematics, 1993.

[3] V. Athitsos, M. Potamias, P. Papapetrou, and G. Kollios. Nearest neighbor retrieval using distance-based hashing. In 2008 IEEE 24th International Conference on Data Engineering, pages 327-336. IEEE, 2008. [OpenAIRE]

[4] A. Berger, R. Caruana, D. Cohn, D. Freitag, and V. Mittal. Bridging the lexical chasm: Statistical approaches to answer-finding. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '00, pages 192-199, New York, NY, USA, 2000. ACM.

[5] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest neighbor” meaningful? In Database theory-ICDT'99, pages 217-235. Springer, 1999.

[6] M. W. Bilotti, P. Ogilvie, J. Callan, and E. Nyberg. Structured retrieval for question answering. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 351-358. ACM, 2007.

[7] L. Boytsov and B. Naidan. Engineering efficient and effective non-metric space library. In Similarity Search and Applications, pages 280-293. Springer, 2013. [OpenAIRE]

[8] A. Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings, pages 21-29. IEEE, 1997.

[9] G.-I. Brokos, P. Malakasiotis, and I. Androutsopoulos. Using centroids of word embeddings and word moverâA˘ Z´ s distance for biomedical document retrieval in question answering. In Proceedings of the BIONLP 2016 Workshop, 2016. [OpenAIRE]

[10] P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational linguistics, 19(2):263-311, 1993.

[11] D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. S. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '01, pages 43-50, New York, NY, USA, 2001. ACM. [OpenAIRE]

[12] C. Carpineto and G. Romano. A survey of automatic query expansion in information retrieval. ACM Comput. Surv., 44(1):1:1-1:50, Jan. 2012.

[13] E. Chávez, G. Navarro, R. Baeza-Yates, and J. L. Marroquín. Searching in metric spaces. ACM Comput. Surv., 33(3):273-321, 2001.

[14] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493-2537, Nov. 2011. [OpenAIRE]

[15] S. Ding and T. Suel. Faster top-k document retrieval using block-max indexes. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '11, pages 993-1002, New York, NY, USA, 2011. ACM. [OpenAIRE]

77 references, page 1 of 6
Abstract
Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective a...
Subjects
free text keywords: Software, business.industry, business, Computer science, Theoretical computer science, Stack overflow, Pipeline transport, Vocabulary mismatch, Retrieval algorithm, Data mining, computer.software_genre, computer, Information retrieval, Computer Science - Information Retrieval
Related Organizations
77 references, page 1 of 6

[1] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal lsh for angular distance. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1225-1233. Curran Associates, Inc., 2015.

[2] S. Arya and D. M. Mount. Approximate nearest neighbor queries in fixed dimensions. In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, pages 271-280. Society for Industrial and Applied Mathematics, 1993.

[3] V. Athitsos, M. Potamias, P. Papapetrou, and G. Kollios. Nearest neighbor retrieval using distance-based hashing. In 2008 IEEE 24th International Conference on Data Engineering, pages 327-336. IEEE, 2008. [OpenAIRE]

[4] A. Berger, R. Caruana, D. Cohn, D. Freitag, and V. Mittal. Bridging the lexical chasm: Statistical approaches to answer-finding. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '00, pages 192-199, New York, NY, USA, 2000. ACM.

[5] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest neighbor” meaningful? In Database theory-ICDT'99, pages 217-235. Springer, 1999.

[6] M. W. Bilotti, P. Ogilvie, J. Callan, and E. Nyberg. Structured retrieval for question answering. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 351-358. ACM, 2007.

[7] L. Boytsov and B. Naidan. Engineering efficient and effective non-metric space library. In Similarity Search and Applications, pages 280-293. Springer, 2013. [OpenAIRE]

[8] A. Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings, pages 21-29. IEEE, 1997.

[9] G.-I. Brokos, P. Malakasiotis, and I. Androutsopoulos. Using centroids of word embeddings and word moverâA˘ Z´ s distance for biomedical document retrieval in question answering. In Proceedings of the BIONLP 2016 Workshop, 2016. [OpenAIRE]

[10] P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational linguistics, 19(2):263-311, 1993.

[11] D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. S. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '01, pages 43-50, New York, NY, USA, 2001. ACM. [OpenAIRE]

[12] C. Carpineto and G. Romano. A survey of automatic query expansion in information retrieval. ACM Comput. Surv., 44(1):1:1-1:50, Jan. 2012.

[13] E. Chávez, G. Navarro, R. Baeza-Yates, and J. L. Marroquín. Searching in metric spaces. ACM Comput. Surv., 33(3):273-321, 2001.

[14] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493-2537, Nov. 2011. [OpenAIRE]

[15] S. Ding and T. Suel. Faster top-k document retrieval using block-max indexes. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '11, pages 993-1002, New York, NY, USA, 2011. ACM. [OpenAIRE]

77 references, page 1 of 6
Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue
publication . Conference object . Preprint . 2016

Let's Replace Term-Based Retrieval with k-NN Search

Boytsov, Leonid; Novak, David; Malkov, Yury; Nyberg, Eric;