Cleaning Data with Forbidden Itemsets

descriptionPublicationkeyboard_double_arrow_right Article 01 Aug 2020 Belgium Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE Transactions on Knowledge and Data Engineering, volume 32, pages 1,489-1,501 (issn: 1041-4347, eissn: 2326-3865,

Copyright policy )

Authors: Joeri Rammelaere; Floris Geerts;

doi: 10.1109/tkde.2019.2905548

handle: 10067/1706200151162165141

Cleaning Data with Forbidden Itemsets

- Summary
- Subjects
- Metrics

Abstract

Methods for cleaning dirty data typically employ additional information about the data such as user-provided constraints specifying when data is dirty, e.g., domain restrictions, illegal value combinations, or logical rules. However, real-world scenarios usually only have dirty data available, without known constraints. In such settings, constraints are automatically discovered on dirty data and discovered constraints are used to detect and repair errors. Typical repairing processes stop there. Yet, when constraint discovery algorithms are re-run on the repaired data (assumed to be clean), new constraints and thus errors are often found. The repairing process then introduces new constraint violations. We present a different type of repairing method, which prevents introducing new constraint violations, according to a discovery algorithm. Summarily, our repairs guarantee that all errors identified by constraints discovered on the dirty data are fixed; and the constraint discovery process cannot identify new constraint violations. We do this for a new kind of constraints, called forbidden itemsets (FBIs), capturing unlikely value co-occurrences. We show that FBIs detect errors with high precision. Evaluation on real-world data shows that our repair method obtains high-quality repairs without introducing new FBIs. Optional user interaction is readily integrated, with users deciding how much effort to invest.

Country

Belgium

Related Organizations

University of Antwerp
Belgium

Keywords

Computer. Automation

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	13
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

13

Top 10%

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now