Powered by OpenAIRE graph
Found an issue? Give us feedback
addClaim

Large-Scale Mining of Co-occurrences: Challenges and Solutions

Authors: Ian Sandler; Alex Thomo;

Large-Scale Mining of Co-occurrences: Challenges and Solutions

Abstract

The ability to extract frequent pairs from a set of baskets (or frequent word co-occurrences from a set of documents) is one of the fundamental building blocks of data mining. When the number of items in a given basket is relatively small the problem is trivial. Even when dealing with millions of baskets it is still trivial providing that the number of unique items in the basket set is small. The problem becomes much more challenging when we deal with millions of baskets, each containing hundreds of items that are part of a set of millions of potential items. Especially when we are looking for highly correlated results at extremely low support levels. A particularly difficult case is when "items" are words and "baskets" are long documents in a very large text corpus. For 17 years the Direct Hashing and Pruning Park Chen Yu (PCY) Algorithm has been the principal technique used when there are billions of potential pairs that need to be counted. In this paper we show new approaches that allow us to take full advantage of both multi-core and multi-CPU setups for cases where PCY fails and Map-Reduce struggles, offering excellent performance scaling when the number of processors, unique items and items per transaction are at their highest. We believe that our approaches have much broader applicability in the field of co-occurrence counting, and can be used to generate much more interesting results when mining very large data sets.

Related Organizations
  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    1
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
1
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!