descriptionPublicationkeyboard_double_arrow_right Article , Conference object 21 Mar 2011Publisher:ACMJournal:Proceedings of the 14th International Conference on Extending Database Technology

Authors: Schlegel, Benjamin; Gemulla, Rainer; Lehner, Wolfgang;

doi: 10.1145/1951365.1951420

Memory-efficient frequent-itemset mining

- Summary
- Subjects
- Metrics

Abstract

Efficient discovery of frequent itemsets in large datasets is a key component of many data mining tasks. In-core algorithms---which operate entirely in main memory and avoid expensive disk accesses---and in particular the prefix tree-based algorithm FP-growth are generally among the most efficient of the available algorithms. Unfortunately, their excessive memory requirements render them inapplicable for large datasets with many distinct items and/or itemsets of high cardinality. To overcome this limitation, we propose two novel data structures---the CFP-tree and the CFP-array---, which reduce memory consumption by about an order of magnitude. This allows us to process significantly larger datasets in main memory than previously possible. Our data structures are based on structural modifications of the prefix tree that increase compressability, an optimized physical representation, lightweight compression techniques, and intelligent node ordering and indexing. Experiments with both real-world and synthetic datasets show the effectiveness of our approach.

Related Organizations

TU Dresden
Germany
Max Planck Society
Germany

Keywords

ddc:004, Algorithmen, Leistungsfähigkeit, Datenbankmanagement, Datenbankanwendungen, Data Mining, Datenstrukturen, Algorithms, Performance, Database Management, Database Applications, Data mining, Data structures, info:eu-repo/classification/ddc/004

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	25
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

Top 10%

Green

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Related to Research communities

EUTOPIA Open Research Portal