
Due to the ever-increasing amount of information and their detailed analysis, the problem of clustering, which is used to reveal hidden patterns in data, is still of great importance. On the other hand, the clustering of important genetic data, which often have high dimensions, faces many limitations using traditional methods. In the current work, a new semi-supervised ensemble spectral clustering (EPLSC) algorithm based on the graph p-Laplacian for genetic data is introduced. In the proposed approach, we first propagate the pairwise must-linked as well as cannot-linked constraints on all data. Then the feature space is randomly split into various unequal subspaces. Using the updated pairwise constraints, semi-supervised spectral clustering is performed in each subspace independently. Then, using the results of each one, an adjacency matrix is created based on ensemble learning. Next, by using several search operators in environments composed of different subspaces, the best set of subspaces is obtained. Experimental validation on 15 high-dimensional genetic datasets demonstrates that EPLSC outperforms existing methods, achieving improvements of up to 18% in Normalized Mutual Information (NMI) and 15% in Adjusted Rand Index (ARI) compared to traditional semi-supervised techniques. This indicates that EPLSC not only enhances clustering efficacy but also effectively addresses the unique challenges posed by genetic data.
QA76.75-76.765, high-dimensional data, Mining engineering. Metallurgy, TN1-997, ensemble learning, random subspace, Computer software, semi-supervised, pairwise constraints, clustering
QA76.75-76.765, high-dimensional data, Mining engineering. Metallurgy, TN1-997, ensemble learning, random subspace, Computer software, semi-supervised, pairwise constraints, clustering
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
