Sparse Computation for Large-Scale Data Mining

descriptionPublicationkeyboard_double_arrow_right Article , Conference object 01 Oct 2014Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE Transactions on Big Data, volume 2, pages 151-174 (eissn: 2332-7790,

Copyright policy )

Authors: Dorit S. Hochbaum; Philipp Baumann;

doi: 10.1109/tbdata.2016.2576470 , 10.1109/bigdata.2014.7004252 , 10.7892/boris.88124

Sparse Computation for Large-Scale Data Mining

- Summary
- Metrics

Abstract

Leading machine learning techniques rely on inputs in the form of pairwise similarities between objects in the data set. The number of pairwise similarities grows quadratically in the size of the data set which poses a challenge in terms of scalability. One way to achieve practical efficiency for similarity-based techniques is to sparsify the similarity matrix. However, existing sparsification approaches consider the complete similarity matrix and remove some of the non-zero entries. This requires quadratic time and storage and is thus intractable for large-scale data sets. We introduce here a method called sparse computation that generates a sparse similarity matrix which contains only relevant similarities without computing first all pairwise similarities. The relevant similarities are identified by projecting the data onto a low-dimensional space in which groups of objects that share the same grid neighborhood are deemed of potential high similarity whereas pairs of objects that do not share a neighborhood are considered to be dissimilar and thus their similarities are not computed. The projection is performed efficiently even for massively large data sets. We apply sparse computation for the $K$ -nearest neighbors algorithm (KNN), for graph-based machine learning techniques of supervised normalized cut and K-supervised normalized cut (SNC and KSNC) and for support vector machines with radial basis function kernels (SVM), on real-world classification problems. Our empirical results show that the approach achieves a significant reduction in the density of the similarity matrix, resulting in a substantial reduction in tuning and testing times, while having a minimal effect (and often none) on accuracy. The low-dimensional projection is of further use in massively large data sets where the grid structure allows to easily identify groups of “almost identical” objects. Such groups of objects are then replaced by representatives, thus reducing the size of the matrix. This approach is effective, as illustrated here for data sets comprising up to 8.5 million objects.

Related Organizations

University of California, Berkeley
United States
University of Bern
Switzerland

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	13
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average