Understanding complex predictive models with ghost variables

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 24 Aug 2022Embargo end date: 01 Jan 2019 English Publisher:Springer Science and Business Media LLCJournal:TEST, volume 32, pages 107-145 (issn: 1133-0686, eissn: 1863-8260,

Copyright policy )

Authors: Pedro Delicado; Daniel Peña;

doi: 10.1007/s11749-022-00826-x , 10.48550/arxiv.1912.06407

arXiv: 1912.06407

handle: 10016/38473 , 2117/383386

Understanding complex predictive models with ghost variables

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

AbstractFramed in the literature on Interpretable Machine Learning, we propose a new procedure to assign a measure of relevance to each explanatory variable in a complex predictive model. We assume that we have a training set to fit the model and a test set to check its out-of-sample performance. We propose to measure the individual relevance of each variable by comparing the predictions of the model in the test set with those obtained when the variable of interest is substituted (in the test set) by its ghost variable, defined as the prediction of this variable by using the rest of explanatory variables. In linear models it is shown that, on the one hand, the proposed measure gives similar results to leave-one-covariate-out (loco, with a lowest computational cost) and outperforms random permutations, and on the other hand, it is strongly related to the usualF-statistic measuring the significance of a variable. In nonlinear predictive models (as neural networks or random forests) the proposed measure shows the relevance of the variables in an efficient way, as shown by a simulation study comparing ghost variables with other alternative methods (includinglocoand random permutations, and also knockoff variables and estimated conditional distributions). Finally, we study the joint relevance of the variables by defining the relevance matrix as the covariance matrix of the vectors of effects on predictions when using every ghost variable. Our proposal is illustrated with simulated examples and the analysis of a large real data set.

Related Organizations

Carlos III University Madrid (UC3M)
Spain
Carlos III University of Madrid
Spain
Universitat Polite`cnica de Catalunya
Spain
Universitat Politècnica de Catalunya
Spain
University of Barcelona
Spain

Keywords

Estadística matemàtica, FOS: Computer and information sciences, Estimated conditional distributions, Computer Science - Machine Learning, Matemáticas, estimated conditional distributions, out-of-sample prediction, Machine Learning (stat.ML), Estadística, Computational aspects of data analysis and big data, Economía, Machine Learning (cs.LG), random permutations, Methodology (stat.ME), Statistical aspects of big data and data science, Classificació AMS::68 Computer science::68T Artificial intelligence, Statistics - Machine Learning, Nonparametric regression and quantile regression, Out-of-sample prediction, Partial correlation matrix, Statistics - Methodology, partial correlation matrix, Interpretable machine learning, explainable artificial intelligence, leave-one-covariate-out, Random permutations, knockoffs, interpretable machine learning, Mathematical statistics, Knockoffs, Àrees temàtiques de la UPC::Matemàtiques i estadística::Anàlisi matemàtica, Leave-one-covariate-out, Classificació AMS::62 Statistics::62G Nonparametric inference, Explainable artificial intelligence

2 Research products, page 1 of 1

hrt software on GitHub
IsRelatedTo
GhostVariables software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	3
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average