
arXiv: 0906.0391
We offer a theoretical validation of the curse of dimensionality in the pivot-based indexing of datasets for similarity search, by proving, in the framework of statistical learning, that in high dimensions no pivot-based indexing scheme can essentially outperform the linear scan. A study of the asymptotic performance of pivot-based indexing schemes is performed on a sequence of datasets modeled as samples $X_d$ picked in i.i.d. fashion from metric spaces $��_d$. We allow the size of the dataset $n=n_d$ to be such that $d$, the ``dimension'', is superlogarithmic but subpolynomial in $n$. The number of pivots is allowed to grow as $o(n/d)$. We pick the least restrictive cost model of similarity search where we count each distance calculation as a single computation and disregard the rest. We demonstrate that if the intrinsic dimension of the spaces $��_d$ in the sense of concentration of measure phenomenon is $O(d)$, then the performance of similarity search pivot-based indexes is asymptotically linear in $n$.
9 pp., 4 figures, latex 2e, a revised submission to the 2nd International Workshop on Similarity Search and Applications, 2009
FOS: Computer and information sciences, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS)
FOS: Computer and information sciences, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS)
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 19 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
