Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ Spiral - Imperial Co...arrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
https://dx.doi.org/10.25560/45...
Other literature type . 2016
Data sources: Datacite
versions View all 2 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Variable selection in the curse of dimensionality

Authors: Mares, Mihaela Andreea;

Variable selection in the curse of dimensionality

Abstract

High-throughput technologies nowadays are leading to massive availability of data to be explored. Therefore, we are keen to build mathematical and statistical meth- ods for extracting as much value from the available data as possible. However, the large dimensionality in terms of both sample size and number of features or variables poses new challenges. The large number of samples can be tackled more easily by increasing computational power and making use of distributed computation tech- nologies. The large number of features or variables poses the risk of explaining variation in both noise and signal with the wrong explanatory variables. One ap- proach to overcome this problem is to select a smaller set of features from the initial set which are most relevant given an assumed prediction model. This approach is called variable or feature selection and implies using a bias or statistical assumption about which features should be considered more relevant. Different feature selection are using different statistical assumptions about the mathematical relation between predicted and explanatory variables and about which explanatory variables should be considered more relevant. Our first contribution in this thesis is to combine the strength of different variables selection methods relying on different statistical assumptions. We start by classifying existing feature selection methods based on their assumptions and assessing their capacity of scaling for high-dimensional data, particularly when the number of samples is much smaller than the number of fea- tures. We propose a new algorithm consisting of combining results from different feature selection methods relying on disjoint assumptions about the function that generated the data and we show that our method will lead to better sensitivity than using each method individually. The assumption of a linear relationship between the predicted variable and the explanatory variables is one of the most widely used simplifying assumptions. Our second contribution is to prove that at least one fea- ture selection algorithm based on the linearity assumption is consistent even when the underlying function that generated the data is not necessarily linear. Based on these theoretical findings we propose a new algorithm which provides better results when the underlying function that generated the data is at most partially linear.Neural networks and in particular deep learning architectures have been shown to be able to fit highly non-linear prediction models when given sufficient training ex- amples. However, they do not embed feature selection mechanisms. We contribute by assessing the performance of these models when given a large number of features and less samples, proposing a method for feature selection and showing in which circumstances combining this feature selection method with deep learning architec- tures will outperform not using feature selection. Several feature selection methods as well as the new methods we have proposed in this thesis rely on re-sampling techniques or using different algorithms for the same dataset. Their advantage is partially gained by using extra computational power. Therefore, our last contribu- tion consists of an efficient data distribution and load balanced parallel calculation for re-sampling based algorithms.

Related Organizations
Keywords

004, 510

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green