Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Closed Access logo, derived from PLoS Open Access logo. This version with transparent background. http://commons.wikimedia.org/wiki/File:Closed_Access_logo_transparent.svg Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao https://doi.org/10.1...arrow_drop_down
image/svg+xml Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Closed Access logo, derived from PLoS Open Access logo. This version with transparent background. http://commons.wikimedia.org/wiki/File:Closed_Access_logo_transparent.svg Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao
DBLP
Conference object . 2022
Data sources: DBLP
versions View all 2 versions
addClaim

Unsupervised clustering of multidimensional distributions using earth mover distance

Authors: David L. Applegate; Tamraparni Dasu; Shankar Krishnan; Simon Urbanek;

Unsupervised clustering of multidimensional distributions using earth mover distance

Abstract

Multidimensional distributions are often used in data mining to describe and summarize different features of large datasets. It is natural to look for distinct classes in such datasets by clustering the data. A common approach entails the use of methods like k-means clustering. However, the k-means method inherently relies on the Euclidean metric in the embedded space and does not account for additional topology underlying the distribution.In this paper, we propose using Earth Mover Distance (EMD) to compare multidimensional distributions. For a n-bin histogram, the EMD is based on a solution to the transportation problem with time complexity O(n3 log n). To mitigate the high computational cost of EMD, we propose an approximation that reduces the cost to linear time. Given the large size of our dataset a fast approximation is crucial for this application.Other notions of distances such as the information theoretic Kullback-Leibler divergence and statistical χ2 distance, account only for the correspondence between bins with the same index, and do not use information across bins, and are sensitive to bin size. A cross-bin distance measure like EMD is not affected by binning differences and meaningfully matches the perceptual notion of "nearness".Our technique is simple, efficient and practical for clustering distributions. We demonstrate the use of EMD on a real-world application of analyzing 411,550 anonymous mobility usage patterns which are defined as distributions over a manifold. EMD allows us to represent inherent relationships in this space, and enables us to successfully cluster even sparse signatures.

Related Organizations
  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    32
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Top 10%
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Top 10%
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Top 10%
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
32
Top 10%
Top 10%
Top 10%
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!