
Frequent itemset discovery is an important step in Association Rule Mining. The Frequent Pattern (FP) growth algorithm, often used for discovering frequent itemsets, cannot scale directly to today’s Big Data, especially for large sparse datasets. Hence there is a need to distribute and parallelize the FP-growth algorithm. Parallel FP-growth (PFP) is a parallel implementation of the FP-growth algorithm on Hadoop’s MapReduce execution framework. Though PFP scales to large datasets, it suffers from imbalanced load across processing units. In this paper we propose a heuristic based, lower order of complexity, load balancing strategy for the PFP algorithm, called Heuristic Based PFP (HBPFP). Our results show that HBPFP distributes the load more evenly across the Hadoop cluster nodes, runs faster than the PFP algorithm, and uses cluster resources more efficiently, especially for large sparse datasets.
Association rule mining, TK7885-7895, Computer engineering. Computer hardware, Frequent pattern growth algorithm, Hadoop, Electronic computers. Computer science, MapReduce, QA75.5-76.95, Load balancing
Association rule mining, TK7885-7895, Computer engineering. Computer hardware, Frequent pattern growth algorithm, Hadoop, Electronic computers. Computer science, MapReduce, QA75.5-76.95, Load balancing
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 17 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
