
Abstract Missing values are common in high-throughput mass spectrometry data. Two strategies are available to address missing values: (i) eliminate or impute the missing values and apply statistical methods that require complete data and (ii) use statistical methods that specifically account for missing values without imputation (imputation-free methods). This study reviews the effect of sample size and percentage of missing values on statistical inference for multiple methods under these two strategies. With increasing missingness, the ability of imputation and imputation-free methods to identify differentially and non-differentially regulated compounds in a two-group comparison study declined. Random forest and k-nearest neighbor imputation combined with a Wilcoxon test performed well in statistical testing for up to 50% missingness with little bias in estimating the effect size. Quantile regression imputation accompanied with a Wilcoxon test also had good statistical testing outcomes but substantially distorted the difference in means between groups. None of the imputation-free methods performed consistently better for statistical testing than imputation methods.
Bioinformatics, Bioinformatics and Computational Biology, 610, Bioengineering, imputation, Mass Spectrometry, 510, missing data, Bias, 2.5 Research design and methodologies (aetiology), Genetics, Cluster Analysis, Other Information and Computing Sciences, mass spectrometry, Computation Theory and Mathematics, Biological Sciences, metabolomics, Bioinformatics and computational biology, sample size, Research Design, Biochemistry and cell biology, Biochemistry and Cell Biology, Generic health relevance
Bioinformatics, Bioinformatics and Computational Biology, 610, Bioengineering, imputation, Mass Spectrometry, 510, missing data, Bias, 2.5 Research design and methodologies (aetiology), Genetics, Cluster Analysis, Other Information and Computing Sciences, mass spectrometry, Computation Theory and Mathematics, Biological Sciences, metabolomics, Bioinformatics and computational biology, sample size, Research Design, Biochemistry and cell biology, Biochemistry and Cell Biology, Generic health relevance
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 21 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
