
XPlode is an algorithm for explaining an observed set of modifications by a user, with the goal of cleaning the data. The explanations take the form of a Conditional Functional Dependency. This source code accompanies the paper "Explaining Repaired Data with CFDs": *Abstract* Many popular data cleaning approaches are rule-based: Constraints are formulated in a logical framework, and data is considered dirty if constraints are violated.These constraints are often discovered from data, but to ascertain their validity, user verification is necessary. Since the full set of discovered constraints is typically too large for manual inspection, recent research integrates user feedback into the discovery process. We propose a different approach that employs user interaction only at the start of the algorithm: a user manu- ally cleans a small set of dirty tuples, and we infer the constraint underlying those repairs, called an explanation. We make use of conditional functional dependencies (CFDs) as the constraint formalism. We introduce XPlode, an on-demand algorithm which discovers the best explanation for a given repair. Guided by this explanation, data can then be cleaned using state-of-the-art CFD-based cleaning algorithms. Experiments on synthetic and real-world datasets show that the best explanation can typically be inferred using a limited number of modifications. Moreover, XPlode is substantially faster than discovering all CFDs that hold on a dataset, and is robust to noise in the modifications.
data-cleaning, Computer Science, functional-dependency, data-quality, Capsule
data-cleaning, Computer Science, functional-dependency, data-quality, Capsule
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
