Regression Phalanxes

Preprint English OPEN
Zhang, Hongyang ; Welch, William J. ; Zamar, Ruben H. (2017)
  • Subject: Statistics - Machine Learning

Tomal et al. (2015) introduced the notion of "phalanxes" in the context of rare-class detection in two-class classification problems. A phalanx is a subset of features that work well for classification tasks. In this paper, we propose a different class of phalanxes for application in regression settings. We define a "Regression Phalanx" - a subset of features that work well together for prediction. We propose a novel algorithm which automatically chooses Regression Phalanxes from high-dimensional data sets using hierarchical clustering and builds a prediction model for each phalanx for further ensembling. Through extensive simulation studies and several real-life applications in various areas (including drug discovery, chemical analysis of spectra data, microarray analysis and climate projections) we show that an ensemble of Regression Phalanxes improves prediction accuracy when combined with effective prediction methods like Lasso or Random Forests.
  • References (13)
    13 references, page 1 of 2

    Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2):185{193.

    Breiman, L. (2001). Random forests. Machine learning, 45(1):5{32.

    Burden, F. R. (1989). Molecular identi cation number for substructure searches. Journal of Chemical Information and Computer Sciences, 29(3):225{227.

    Carhart, R. E., Smith, D. H., and Venkataraghavan, R. (1985). Atom pairs as molecular features in structure-activity studies: de nition and applications. Journal of Chemical Information and Computer Sciences, 25(2):64{73.

    Esbensen, K., Midtgaard, T., and Schonkopf, S. (1996). Multivariate Analysis in Practice: A Training Package. Camo As.

    Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1.

    Hughes-Oliver, J. M., Brooks, A. D., Welch, W. J., Khaledi, M. G., Hawkins, D., Young, S. S., Patil, K., Howell, G. W., Ng, R. T., and Chu, M. T. (2010). Chemmodlab: a web-based cheminformatics modeling laboratory. In silico biology, 11(1-2):61{81.

    Lemberge, P., De Raedt, I., and Janssens, K. H. (2000). Quantitative analysis of 16- 17th century archaeological glass vessels using pls regression of epxma and mu-xrf data. Journal of chemometrics, 14(5):751{764.

    Liu, K., Feng, J., and Young, S. S. (2005). Powermv: a software environment for molecular viewing, descriptor generation, data analysis and hit evaluation. Journal of chemical information and modeling, 45(2):515{522.

    Sargsyan, K., Safta, C., Najm, H. N., Debusschere, B. J., Ricciuto, D., and Thornton, P. (2014). Dimensionality reduction for complex models via bayesian compressive sensing. International Journal for Uncertainty Quanti cation, 4(1).

  • Metrics
    No metrics available
Share - Bookmark