Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ Big Data Researcharrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
Big Data Research
Article
Data sources: UnpayWall
image/svg+xml Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Closed Access logo, derived from PLoS Open Access logo. This version with transparent background. http://commons.wikimedia.org/wiki/File:Closed_Access_logo_transparent.svg Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao
Big Data Research
Article . 2016 . Peer-reviewed
License: Elsevier TDM
Data sources: Crossref
DBLP
Article . 2020
Data sources: DBLP
versions View all 2 versions
addClaim

Approximate Parallel High Utility Itemset Mining

Authors: Yan Chen 0021; Aijun An;

Approximate Parallel High Utility Itemset Mining

Abstract

High utility itemset mining discovers itemsets whose utility is above a given threshold, where the utility measures the importance of an itemset. It overcomes the limitation of frequent pattern mining, which uses frequency as its quality measure. To speed up the performance for mining high utility itemsets, many algorithms have been proposed which usually focus on optimizing the candidate generation process. However, memory and time performance limitations still cause scalability issues, especially when the dataset is very large. In this paper, the problem is addressed by proposing a distributed parallel algorithm, PHUI-Miner, and a sampling strategy, which can be used either separately or simultaneously. PHUI-Miner parallelizes the state-of-the-art high utility itemset mining algorithm HUI-Miner. In PHUI-Miner, the search space of the high utility itemset mining problem is divided and assigned to nodes in a cluster, which splits the workload. The sampling strategy investigates the required sample size of a dataset, in order to achieve a given accuracy. The sample size is selected based on a new theorem, which provides a theoretical guarantee on the accuracy of results. We also propose an approach combining sampling with PHUI-Miner, which mines an approximate set of results, but could provide better time performance. In our experiments, we show that PHUI-Miner has high performance on different datasets and outperforms the state-of-the-art non-parallel algorithm HUI-Miner. The sampling strategy achieves accuracies much higher than the guarantee provided by the theorems in practice. Extensive experiments are also conducted to compare the time performance of PHUI-Miner with and without sampling.

Related Organizations
  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    46
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Top 10%
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Top 10%
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Top 10%
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
46
Top 10%
Top 10%
Top 10%
bronze