CountSketches, Feature Hashing and the Median of Three

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object , Contribution for newspaper or weekly magazine 01 Jan 2021Embargo end date: 01 Jan 2021 Denmark Publisher:arXivJournal:CoRR, volume abs/2102.02193

Authors: Larsen, Kasper Green; Pagh, Rasmus; Tetek, Jakub;

doi: 10.48550/arxiv.2102.02193

arXiv: 2102.02193

CountSketches, Feature Hashing and the Median of Three

- Summary
- Subjects
- Related research
  (8)
- Metrics

Abstract

In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector $v$ to a vector of dimension $(2t-1) s$, where $t, s > 0$ are integer parameters. It is known that even for $t=1$, a CountSketch allows estimating coordinates of $v$ with variance bounded by $\|v\|_2^2/s$. For $t > 1$, the estimator takes the median of $2t-1$ independent estimates, and the probability that the estimate is off by more than $2 \|v\|_2/\sqrt{s}$ is exponentially small in $t$. This suggests choosing $t$ to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant $t$. Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of Count-Sketch, showing an improvement in variance to $O(\min\{\|v\|_1^2/s^2,\|v\|_2^2/s\})$ when $t > 1$. That is, the variance decreases proportionally to $s^{-2}$, asymptotically for large enough $s$. We also study the variance in the setting where an inner product is to be estimated from two CountSketches. This finding suggests that the Feature Hashing method, which is essentially identical to CountSketch but does not make use of the median estimator, can be made more reliable at a small cost in settings where using a median estimator is possible. We confirm our theoretical findings in experiments and thereby help justify why a small constant number of estimates often suffice in practice. Our improved variance bounds are based on new general theorems about the variance and higher moments of the median of i.i.d. random variables that may be of independent interest.

Country

Denmark

Related Organizations

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS), Machine Learning (stat.ML), Machine Learning (cs.LG)

8 Research products, page 1 of 1

ℓp Subspace Embedding in Input Sparsity Time
2021IsAmongTopNSimilarDocuments
Sublinear update time randomized algorithms for dynamic graph regression
2021IsAmongTopNSimilarDocuments
pylspack : Parallel Algorithms and Data Structures for Sketching, Column Subset Selection, Regression, and Leverage Scores
2022IsAmongTopNSimilarDocuments
Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation
2023IsAmongTopNSimilarDocuments
Fast randomized matrix and tensor interpolative decomposition using CountSketch
2020IsAmongTopNSimilarDocuments
Universal Online Sketch for Tracking Heavy Hitters and Estimating Moments of Data Streams
2020IsAmongTopNSimilarDocuments
Beating CountSketch for heavy hitters in insertion streams
2016IsAmongTopNSimilarDocuments
How to Reduce Dimension With PCA and Random Projections?
2021IsAmongTopNSimilarDocuments

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Fields of Science (4) View all

natural sciences

computer and information sciences

Fields of Science

natural sciences

computer and information sciences

View all

Related to Research communities

UArctic