
pmid: 32956065
In unsupervised learning literature, the study of clustering using microarray gene expression datasets has been extensively conducted with nonnegative matrix factorization (NMF), spectral clustering, kmeans, and gaussian mixture model (GMM)are some of the most used methods. However, there is still a limited number of works that utilize statistical analysis to measure the significances of performance differences between these methods. In this paper, statistical analysis of performance differences between ten NMF, six spectral clustering, four GMM, and the standard kmeans algorithms in clustering eleven publicly available microarray gene expression datasets with the number of clusters ranges from two to ten is presented. The experimental results show that statistically NMFs and kmeans have similar performances and outperform spectral clustering. As spectral clustering can be used to uncover hidden manifold structures, the underperformance of spectral methods leads us to question whether the datasets have manifold structures. Visual inspection using multidimensional scaling plots indicates that such structures do not exist. Moreover, as the plots indicate that clusters in some datasets have elliptical boundaries, GMM methods are also utilized. The experimental results show that GMM methods outperform the other methods to some degree, and thus imply that the datasets follow gaussian distributions.
Normal Distribution, Cluster Analysis, Algorithms
Normal Distribution, Cluster Analysis, Algorithms
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 38 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 1% |
