New Multi-Label Correlation-Based Feature Selection Methods for Multi-Label Classification and Application in Bioinformatics

Doctoral thesis English OPEN
Jungjit, Suwimol (2016)
  • Subject: Q | T

The very large dimensionality of real world datasets is a challenging problem for classification algorithms, since often many features are redundant or irrelevant for classification. In addition, a very large number of features leads to a high computational time for classification algorithms. Feature selection methods are used to deal with the large dimensionality of data by selecting a relevant feature subset according to an evaluation criterion. The vast majority of research on feature selection involves conventional single-label classification problems, where each instance is assigned a single class label; but there has been growing research on more complex multi-label classification problems, where each instance can be assigned multiple class labels.\ud \ud This thesis proposes three types of new Multi-Label Correlation-based Feature Selection (ML-CFS) methods, namely: (a) methods based on hill-climbing search, (b) methods that exploit biological knowledge (still using hill-climbing search), and (c) methods based on genetic algorithms as the search method.\ud \ud Firstly, we proposed three versions of ML-CFS methods based on hill climbing search. In essence, these ML-CFS versions extend the original CFS method by extending the merit function (which evaluates candidate feature subsets) to the multi-label classification scenario, as well as modifying the merit function in other ways. A conventional search strategy, hill-climbing, was used to explore the space of candidate solutions (candidate feature subsets) for those three versions of ML-CFS. These ML-CFS versions are described in detail in Chapter 4.\ud \ud Secondly, in order to try to improve the performance of ML-CFS in cancer-related microarray gene expression datasets, we proposed three versions of the ML-CFS method that exploit biological knowledge. These ML-CFS versions are also based on hill-climbing search, but the merit function was modified in a way that favours the selection of genes (features) involved in pre-defined cancer-related pathways, as discussed in detail in Chapter 5.\ud \ud Lastly, we proposed two more sophisticated versions of ML-CFS based on Genetic Algorithms (rather than hill-climbing) as the search method. The first version of GA-based ML-CFS is based on a conventional single-objective GA, where there is only one objective to be optimized; while the second version of GA-based ML-CFS performs lexicographic multi-objective optimization, where there are two objectives to be optimized, as discussed in detail in Chapter 6.\ud \ud In this thesis, all proposed ML-CFS methods for multi-label classification problems were evaluated by measuring the predictive accuracies obtained by two well-known multi-label classification algorithms when using the selected featuresม namely: the Multi-Label K-Nearest neighbours (ML-kNN) algorithm and the Multi-Label Back Propagation Multi-Label Learning Neural Network (BPMLL) algorithm.\ud \ud In general, the results obtained by the best version of the proposed ML-CFS methods, namely a GA-based ML-CFS method, were competitive with the results of other multi-label feature selection methods and baseline approaches. More precisely, one of our GA-based methods achieved the second best predictive accuracy out of all methods being compared (both with ML-kNN and BPMLL used as classifiers), but there was no statistically significant difference between that GA-based ML-CFS and the best method in terms of predictive accuracy. In addition, in the experiment with ML-kNN (the most accurate) method selects about twice as many features as our GA-based ML-CFS; whilst in the experiments with BPMLL the most accurate method was a baseline method that does not perform any feature selection, and runs the classifier once (with all original features) for each of the many class labels, which is a very computationally expensive baseline approach.\ud \ud In summary, one of the proposed GA-based ML-CFS methods managed to achieve substantial data reduction, (selecting a smaller subset of relevant features) without a significant decrease in predictive accuracy with respect to the most accurate method.
  • References (58)
    58 references, page 1 of 6

    [1] Aghdam, M. H., Ghasem-Aghaee, N., and Basiri, M. E. Text feature selection using ant colony optimization. Expert systems with applications 36, 3 (2009), 6843-6853.

    [2] Aksoy, S., and Haralick, R. M. Feature normalization and likelihoodbased similarity measures for image retrieval. Pattern Recognition Letters 22, 5 (2001), 563-582.

    [3] Al-Ani, A. Ant colony optimization for feature subset selection. In WEC (2) (2005), Citeseer, pp. 35-38.

    [4] Babu, M. M. Introduction to microarray data analysis. Computational Genomics: Theory and Application (2004), 225-249.

    [5] Bala, J., De Jong, K., Huang, J., Vafaie, H., and Wechsler, H. Using learning to facilitate the evolution of features for recognizing visual concepts. Evolutionary Computation 4, 3 (1996), 297-311.

    [6] Bala, J., Huang, J., Vafaie, H., DeJong, K., and Wechsler, H. Hybrid learning using genetic algorithms and decision trees for pattern classification. In IJCAI (1) (1995), Citeseer, pp. 719-724.

    [7] Bandyopadhyay, N., Kahveci, T., Goodison, S., Sun, Y., and Ranka, S. Pathway-based feature selection algorithm for cancer microarray data. Advances in bioinformatics 2009 (2010).

    [8] Berry, M. J., and Linoff, G. S. Data mining techniques: for marketing, sales, and customer relationship management. John Wiley & Sons, 2004.

    [9] Blickle, T. Tournament selection. Evolutionary computation 1 (2000), 181-186.

    [10] Boutell, M. R., Luo, J., Shen, X., and Brown, C. M. Learning multi-label scene classification. Pattern Recognition 37, 9 (2004), 1757-1771.

  • Metrics
    views in OpenAIRE
    views in local repository
    downloads in local repository

    The information is available from the following content providers:

    From Number Of Views Number Of Downloads
    Kent Academic Repository - IRUS-UK 0 47
Share - Bookmark