Confound-leakage: confound removal in machine learning leads to leakage

descriptionPublicationkeyboard_double_arrow_right Article , Conference object , Other literature type , Preprint 28 Dec 2022Embargo end date: 01 Jan 2022 Germany, United Kingdom English Publisher:Oxford University Press (OUP)Journal:GigaScience, volume 12 (eissn: 2047-217X,

Copyright policy )Funded by:DFG | unidentified, UKRI | SWANS

Authors: Sami Hamdan; Bradley C Love; Georg G von Polier; Susanne Weis; Holger Schwender; Simon B Eickhoff; Kaustubh R Patil;

doi: 10.1093/gigascience/giad071 , 10.48550/arxiv.2210.09232 , 10.34734/fzj-2023-03119 , 10.34734/fzj-2023-03045

pmid: 37776368

pmc: PMC10541796

arXiv: 2210.09232

Confound-leakage: confound removal in machine learning leads to leakage

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

Abstract Background Machine learning (ML) approaches are a crucial component of modern data analysis in many fields, including epidemiology and medicine. Nonlinear ML methods often achieve accurate predictions, for instance, in personalized medicine, as they are capable of modeling complex relationships between features and the target. Problematically, ML models and their predictions can be biased by confounding information present in the features. To remove this spurious signal, researchers often employ featurewise linear confound regression (CR). While this is considered a standard approach for dealing with confounding, possible pitfalls of using CR in ML pipelines are not fully understood. Results We provide new evidence that, contrary to general expectations, linear confound regression can increase the risk of confounding when combined with nonlinear ML approaches. Using a simple framework that uses the target as a confound, we show that information leaked via CR can increase null or moderate effects to near-perfect prediction. By shuffling the features, we provide evidence that this increase is indeed due to confound-leakage and not due to revealing of information. We then demonstrate the danger of confound-leakage in a real-world clinical application where the accuracy of predicting attention-deficit/hyperactivity disorder is overestimated using speech-derived features when using depression as a confound. Conclusions Mishandling or even amplifying confounding effects when building ML models due to confound-leakage, as shown, can lead to untrustworthy, biased, and unfair predictions. Our expose of the confound-leakage pitfall and provided guidelines for dealing with it can help create more robust and trustworthy ML models.

Countries

Germany, United Kingdom

Related Organizations

Helmholtz Association of German Research Centres
Germany
RWTH Aachen University
Germany
Heinrich Heine University Düsseldorf
Germany
Forschungszentrum Jülich
Germany
University College London
United Kingdom

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Research, machine-learning, 610, Machine Learning (stat.ML), clinical applications, data-leakage, confounding, Machine Learning (cs.LG), Machine Learning, Artificial Intelligence (cs.AI), Statistics - Machine Learning, info:eu-repo/classification/ddc/610

1 Research products, page 1 of 1

ConfoundLeakage software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	12
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%