
arXiv: 2403.19378
The repair problem for functional dependencies is the problem where an input database needs to be modified such that all functional dependencies are satisfied and the difference with the original database is minimal. The output database is then called a minimal-cost repair . If the allowed modifications are value updates, then finding a minimal-cost repair is NP-hard. A well-known approach to find approximations of minimal-cost repairs builds a Chase tree in which each internal node resolves violations of one functional dependency and leaf nodes represent repairs. A key property of this approach is that controlling the branching factor of the Chase tree allows to control the tradeoff between repair quality and computational efficiency. In this article, we explore an extreme variant of this idea in which the Chase tree has only one path. To construct this path, we first create an ordered partition of attributes (i.e., a partition of which the classes are totally ordered) such that classes can be repaired sequentially. We repair each class only once and do so by fixing the order in which dependencies are repaired. This principle is called priority repairing , and we provide a simple heuristic to determine priority. The techniques for attribute partitioning and priority repair are combined in an algorithm called Swipe. An empirical study on four real-life datasets shows that Swipe is one to three orders of magnitude faster than Llunatic and HoloClean, whereas the quality of repairs is comparable or better. A scalability analysis shows that Swipe scales linearly for an increasing number of tuples and quadratically for an increasing number of FDs.
FOS: Computer and information sciences, Technology and Engineering, Computer Science - Databases, FUNCTIONAL-DEPENDENCIES, Data quality, REPAIRS, VIOLATIONS, Databases (cs.DB), functional dependencies, data cleaning, error detection and repair
FOS: Computer and information sciences, Technology and Engineering, Computer Science - Databases, FUNCTIONAL-DEPENDENCIES, Data quality, REPAIRS, VIOLATIONS, Databases (cs.DB), functional dependencies, data cleaning, error detection and repair
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
