Variable selection in the curse of dimensionality

descriptionPublicationkeyboard_double_arrow_right Other literature type , Thesis , Doctoral thesis 01 Jan 2016Embargo end date: 05 May 2017Publisher:Imperial College London

Authors: Mares, Mihaela Andreea;

doi: 10.25560/45463

handle: 10044/1/45463

Variable selection in the curse of dimensionality

- Summary
- Subjects
- Metrics

Abstract

High-throughput technologies nowadays are leading to massive availability of data to be explored. Therefore, we are keen to build mathematical and statistical meth- ods for extracting as much value from the available data as possible. However, the large dimensionality in terms of both sample size and number of features or variables poses new challenges. The large number of samples can be tackled more easily by increasing computational power and making use of distributed computation tech- nologies. The large number of features or variables poses the risk of explaining variation in both noise and signal with the wrong explanatory variables. One ap- proach to overcome this problem is to select a smaller set of features from the initial set which are most relevant given an assumed prediction model. This approach is called variable or feature selection and implies using a bias or statistical assumption about which features should be considered more relevant. Different feature selection are using different statistical assumptions about the mathematical relation between predicted and explanatory variables and about which explanatory variables should be considered more relevant. Our first contribution in this thesis is to combine the strength of different variables selection methods relying on different statistical assumptions. We start by classifying existing feature selection methods based on their assumptions and assessing their capacity of scaling for high-dimensional data, particularly when the number of samples is much smaller than the number of fea- tures. We propose a new algorithm consisting of combining results from different feature selection methods relying on disjoint assumptions about the function that generated the data and we show that our method will lead to better sensitivity than using each method individually. The assumption of a linear relationship between the predicted variable and the explanatory variables is one of the most widely used simplifying assumptions. Our second contribution is to prove that at least one fea- ture selection algorithm based on the linearity assumption is consistent even when the underlying function that generated the data is not necessarily linear. Based on these theoretical findings we propose a new algorithm which provides better results when the underlying function that generated the data is at most partially linear.Neural networks and in particular deep learning architectures have been shown to be able to fit highly non-linear prediction models when given sufficient training ex- amples. However, they do not embed feature selection mechanisms. We contribute by assessing the performance of these models when given a large number of features and less samples, proposing a method for feature selection and showing in which circumstances combining this feature selection method with deep learning architec- tures will outperform not using feature selection. Several feature selection methods as well as the new methods we have proposed in this thesis rely on re-sampling techniques or using different algorithms for the same dataset. Their advantage is partially gained by using extra computational power. Therefore, our last contribu- tion consists of an efficient data distribution and load balanced parallel calculation for re-sampling based algorithms.

Related Organizations

Imperial College London
United Kingdom

Keywords

004, 510

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green