A Benchmark for Data Imputation Methods

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 08 Jul 2021Publisher:Frontiers Media SAJournal:Frontiers in Big Data, volume 4 (eissn: 2624-909X,

Copyright policy )

Authors: Sebastian Jäger; Arndt Allhorn; Felix Bießmann;

doi: 10.3389/fdata.2021.693674

pmid: 34308343

pmc: PMC8297389

A Benchmark for Data Imputation Methods

- Summary
- Subjects
- Metrics

Abstract

With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.

Related Organizations

Beuth University of Applied Sciences
Germany

Keywords

Big Data, missing data, benchmark, MCAR, data quality, imputation, Information technology, T58.5-58.64, data cleaning

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	191
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 1%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 1%

Found an issue? Give us feedback

191

Top 1%

Green

gold

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering