Vtreat: A Data.Frame Processor For Predictive Modeling

descriptionPublicationkeyboard_double_arrow_right Preprint , Article , Other literature type 01 Jan 2016Embargo end date: 01 Jan 2016 English Publisher:Zenodo

Authors: Zumel, Nina; Mount, John;

doi: 10.5281/zenodo.1173314 , 10.5281/zenodo.3462537 , 10.48550/arxiv.1611.09477 , 10.5281/zenodo.3265462 , 10.5281/zenodo.1173313

arXiv: 1611.09477

Vtreat: A Data.Frame Processor For Predictive Modeling

- Summary
- Subjects
- Metrics

Abstract

We look at common problems found in data that is used for predictive modeling tasks, and describe how to address them with the vtreat R package. vtreat prepares real-world data for predictive modeling in a reproducible and statistically sound manner. We describe the theory of preparing variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems dealt with include: infinite values, invalid values, NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). Of special interest are techniques needed to avoid needlessly introducing undesirable nested modeling bias (which is a risk when using a data-preprocessor).

Keywords

FOS: Computer and information sciences, Applications (stat.AP), Statistics - Applications

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average