Similarity analysis of feature ranking techniques on imbalanced DNA microarray datasets

descriptionPublicationkeyboard_double_arrow_right Article 01 Oct 2012Publisher:IEEEJournal:2012 IEEE International Conference on Bioinformatics and Biomedicine

Authors: Randall Wald; Taghi M. Khoshgoftaar; Amri Napolitano; David J. Dittman;

doi: 10.1109/bibm.2012.6392708

Similarity analysis of feature ranking techniques on imbalanced DNA microarray datasets

- Summary
- Metrics

Abstract

DNA microarrays are a modern advancement in the analysis of genetic data. This technology allows a researcher to test samples for thousands of genes simultaneously. However, once the samples in the DNA microarrays have been tested, the researcher must then search through the data collected and identify genes important to their problem. A possible solution to this issue is the data mining pre-processing technique called feature selection. Feature (gene) selection takes the original set of features (in the case of DNA microarrays, gene probes) and chooses an optimal subset to perform analysis from. Ideally, the reduced subset only contains the most important features as determined by the feature selection technique (or set of feature selection techniques), which allows for further research in the discovered genes. However in the case of using multiple feature selection techniques, the set of techniques must be diverse in order to reduce redundancy among the chosen features. Another benefit of increasing diversity is that any features chosen across a diverse set of feature selection techniques will have more importance than those chosen by a single technique or a set of related ones. Therefore, it would be useful to know how similar the feature selection techniques are to each other. In this study we perform an analysis of eighteen feature selection techniques across nine imbalanced DNA microarray datasets and using four feature subset sizes. Our results found that one should not use Gini Index and Probability Ratio together or the Kolmogorov-Smirnov statistic and Geometric Mean together at any feature subset size in order to minimize redundancy, and that the members of the first of these pairs (along with the pair of ReliefF and ReliefF-W) are very dissimilar to all rankers outside their own cluster. We also found that Chi-Squared, Information Gain, and Symmetric Uncertainty form a cluster of similarity, as do Chi-Squared, Deviance, F-Measure, and Mutual Information.

Related Organizations

Florida Atlantic University
United States

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	11
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Top 10%

Average

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now