A Comparison of Imputation Methods for Categorical Data

descriptionPublicationkeyboard_double_arrow_right Article 01 Jan 2023Publisher:Elsevier BVJournal:Informatics in Medicine Unlocked, volume 42, page 101,382 (issn: 2352-9148,

Copyright policy )

Authors: Shaheen MZ. Memon; Robert Wamala; Ignace H. Kabano;

doi: 10.2139/ssrn.4574180 , 10.1016/j.imu.2023.101382 , 10.2139/ssrn.4510943

A Comparison of Imputation Methods for Categorical Data

- Summary
- Subjects
- Metrics

Abstract

Objectives: Missing data is commonplace in clinical databases, which are being increasingly used for research. Without giving any regard to missing data, results from analysis may become biased and unrepresentative. Clinical databases contain mainly categorical variables. This study aims to assess the methods used for imputation in categorical variables. Materials and methods: We utilized data extracted from paper-based maternal health records from Kawempe National Referral Hospital, Uganda. We compared the following imputation methods for categorical data in an empirical analysis: Mode, K-Nearest Neighbors (KNN), Random Forest (RF), Sequential Hot-Deck (SHD), and Multiple Imputation by Chained Equations (MICE). The five imputation methods were first compared by accuracy of predicting the missing values. Next, the imputation methods were compared by predictive accuracy of the outcome variable in four classifiers. The consistency of performance of imputation methods across different levels of missing data (5%–50 %) was assessed by Kendall's W test. Results: KNN imputation had the highest precision score at levels (5%–50 %) of MCAR missing data. At lower proportions of missing data (5 %, 10 %, 15 %, 20 %), RF imputation had the second-highest precision score. SHD imputation had the worst precision at all levels of missing data. In the prediction of the outcome, the methods performed differently at all proportions of missing data in the four classifiers. Even though KNN imputation was the best method in predicting the missing values, it did not consistently enhance the predictive accuracy of the classifiers at all levels of missing data. Our findings show that a high precision score of an imputation method does not translate into higher predictive accuracy in classifiers. Conclusions: KNN imputation is the best method in predicting missing values in categorical variables. There is no universal best imputation method that yields the highest predictive accuracy at all proportions of missing data.

Related Organizations

University of Rwanda
Rwanda
Makerere University
Uganda

Keywords

Precision score, Single imputation, Computer applications to medicine. Medical informatics, Multiple imputation, R858-859.7, Imputation, Categorical variables

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	46
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 1%

Found an issue? Give us feedback

46

Top 1%

Top 10%

Top 1%

gold