publication . Preprint . 2019

Model-based clustering in very high dimensions via adaptive projections

Taschler, Bernd; Dondelinger, Frank; Mukherjee, Sach;
Open Access English
  • Published: 22 Feb 2019
Abstract
Mixture models are a standard approach to dealing with heterogeneous data with non-i.i.d. structure. However, when the dimension $p$ is large relative to sample size $n$ and where either or both of means and covariances/graphical models may differ between the latent groups, mixture models face statistical and computational difficulties and currently available methods cannot realistically go beyond $p \! \sim \! 10^4$ or so. We propose an approach called Model-based Clustering via Adaptive Projections (MCAP). Instead of estimating mixtures in the original space, we work with a low-dimensional representation obtained by linear projection. The projection dimension ...
Subjects
free text keywords: Statistics - Machine Learning, Computer Science - Machine Learning
Download from

D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671-687, 2003. ISSN 00220000. doi:10.1016/S0022-0000(03)00025-4. [OpenAIRE]

K. Roeder and L. Wasserman. Practical bayesian density estimation using mixtures of normals. Journal of the American Statistical Association, 92(439):894-902, 9 1997. ISSN 1537274X. doi:10.1080/01621459.1997.10474044.

L. Scrucca, M. Fop, T. B. Murphy, and A. E. Raftery. mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1):289-317, 8 2016. ISSN 2073-4859. doi:10.1177/2167702614534210. [OpenAIRE]

N. St¨adler and S. Mukherjee. Penalized estimation in high-dimensional hidden Markov models with statespecific graphical models. The Annals of Applied Statistics, 7(4):2157-2179, 2013. doi:10.2307/23566458.

N. St¨adler and S. Mukherjee. Two-sample testing in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):225-246, 4 2017. ISSN 13697412. doi:10.1111/rssb.12173.

N. St¨adler, F. Dondelinger, S. M. Hill, R. Akbani, Y. Lu, G. B. Mills, and S. Mukherjee. Molecular heterogeneity at the network level: High-dimensional testing, clustering and a TCGA case study. Bioinformatics, 33(18):2890-2896, 9 2017. ISSN 14602059. doi:10.1093/bioinformatics/btx322. [OpenAIRE]

R. Tibshirani and G. Walther. Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3):511-528, 2005. ISSN 10618600. doi:10.1198/106186005X59243. [OpenAIRE]

D. Usoskin, A. Furlan, S. Islam, H. Abdo, P. Lo¨nnerberg, D. Lou, J. Hjerling-Leffler, J. Haeggstr¨om, O. Kharchenko, P. V. Kharchenko, S. Linnarsson, and P. Ernfors. Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nature Neuroscience, 18(1):145-153, 2015. ISSN 15461726. doi:10.1038/nn.3881.

U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(March):395-416, 2006. ISSN 09603174. doi:10.1007/s11222-007-9033-z.

A. Zeisel, A. B. Munoz Manchado, S. Codeluppe, P. Lo¨nnerberg, G. La Manno, A. Jureus, S. Marques, H. Munguba, L. He, C. Betsholtz, C. Rolny, G. Castelo-Branco, J. Hjerling-Leffler, and S. Linnarsson. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science, 25:279-284, 2014.

T. Zhao, H. Liu, K. Roeder, J. Lafferty, and L. Wasserman. The huge package for high-dimensional undirected graph estimation in R. Journal of Machine Learning Research, 13:1059-1062, 2012. ISSN 1532-4435. doi:10.1002/aur.1474.Replication.

H. Zhou, W. Pan, and X. Shen. Penalized model-based clustering with unconstrained covariance matrices. Electronic Journal of Statistics, 3(0):1473-1496, 2009. ISSN 19357524. doi:10.1214/09-EJS487.

Related research
Abstract
Mixture models are a standard approach to dealing with heterogeneous data with non-i.i.d. structure. However, when the dimension $p$ is large relative to sample size $n$ and where either or both of means and covariances/graphical models may differ between the latent groups, mixture models face statistical and computational difficulties and currently available methods cannot realistically go beyond $p \! \sim \! 10^4$ or so. We propose an approach called Model-based Clustering via Adaptive Projections (MCAP). Instead of estimating mixtures in the original space, we work with a low-dimensional representation obtained by linear projection. The projection dimension ...
Subjects
free text keywords: Statistics - Machine Learning, Computer Science - Machine Learning
Download from

D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671-687, 2003. ISSN 00220000. doi:10.1016/S0022-0000(03)00025-4. [OpenAIRE]

K. Roeder and L. Wasserman. Practical bayesian density estimation using mixtures of normals. Journal of the American Statistical Association, 92(439):894-902, 9 1997. ISSN 1537274X. doi:10.1080/01621459.1997.10474044.

L. Scrucca, M. Fop, T. B. Murphy, and A. E. Raftery. mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1):289-317, 8 2016. ISSN 2073-4859. doi:10.1177/2167702614534210. [OpenAIRE]

N. St¨adler and S. Mukherjee. Penalized estimation in high-dimensional hidden Markov models with statespecific graphical models. The Annals of Applied Statistics, 7(4):2157-2179, 2013. doi:10.2307/23566458.

N. St¨adler and S. Mukherjee. Two-sample testing in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):225-246, 4 2017. ISSN 13697412. doi:10.1111/rssb.12173.

N. St¨adler, F. Dondelinger, S. M. Hill, R. Akbani, Y. Lu, G. B. Mills, and S. Mukherjee. Molecular heterogeneity at the network level: High-dimensional testing, clustering and a TCGA case study. Bioinformatics, 33(18):2890-2896, 9 2017. ISSN 14602059. doi:10.1093/bioinformatics/btx322. [OpenAIRE]

R. Tibshirani and G. Walther. Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3):511-528, 2005. ISSN 10618600. doi:10.1198/106186005X59243. [OpenAIRE]

D. Usoskin, A. Furlan, S. Islam, H. Abdo, P. Lo¨nnerberg, D. Lou, J. Hjerling-Leffler, J. Haeggstr¨om, O. Kharchenko, P. V. Kharchenko, S. Linnarsson, and P. Ernfors. Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nature Neuroscience, 18(1):145-153, 2015. ISSN 15461726. doi:10.1038/nn.3881.

U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(March):395-416, 2006. ISSN 09603174. doi:10.1007/s11222-007-9033-z.

A. Zeisel, A. B. Munoz Manchado, S. Codeluppe, P. Lo¨nnerberg, G. La Manno, A. Jureus, S. Marques, H. Munguba, L. He, C. Betsholtz, C. Rolny, G. Castelo-Branco, J. Hjerling-Leffler, and S. Linnarsson. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science, 25:279-284, 2014.

T. Zhao, H. Liu, K. Roeder, J. Lafferty, and L. Wasserman. The huge package for high-dimensional undirected graph estimation in R. Journal of Machine Learning Research, 13:1059-1062, 2012. ISSN 1532-4435. doi:10.1002/aur.1474.Replication.

H. Zhou, W. Pan, and X. Shen. Penalized model-based clustering with unconstrained covariance matrices. Electronic Journal of Statistics, 3(0):1473-1496, 2009. ISSN 19357524. doi:10.1214/09-EJS487.

Related research
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue