publication . Article . Preprint . 2017

Stability of topic modeling via matrix factorization

Brian Mac Namee;
Open Access English
  • Published: 23 Feb 2017
  • Publisher: Elsevier
  • Country: Ireland
Abstract
The problem of the instability of standard topic modeling algorithms is investigated.Three new stability measures for topic models are proposed.Two new ensemble approaches for topic modeling with matrix factorization are proposed.A detailed evaluation of these approaches is performed on 10 text corpora. Topic models can provide us with an insight into the underlying latent structure of a large corpus of documents. A range of methods have been proposed in the literature, including probabilistic topic models and techniques based on matrix factorization. However, in both cases, standard implementations rely on stochastic elements in their initialization phase, whic...
Subjects
free text keywords: Topic modeling, Topic stability, LDA, NMF, Computer Science - Information Retrieval, Computer Science - Computation and Language, Computer Science - Learning, Statistics - Machine Learning
33 references, page 1 of 3

Arora, S., Ge, R., and Moitra, A. (2012). Learning topic models - Going beyond SVD. In Proc. 53rd Symp. Foundations of Computer Science, pages 1-10. IEEE. [OpenAIRE]

Arthur, D. and Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027-1035. Society for Industrial and Applied Mathematics.

Ben-Hur, A., Elisseeff, A., and Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Proc. 7th Pacific Symposium on Biocomputing, pages 6-17.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3:993-1022.

Bouma, G. (2009). Normalized Pointwise Mutual Information in Collocation Extraction. In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology, GCSL '09.

Boutsidis, C. and Gallopoulos, E. (2008). SVD based initialization: A head start for non-negative matrix factorization. Pattern Recognition. [OpenAIRE]

Bradley, P. S. and Fayyad, U. M. (1998). Refining initial points for k-means clustering. In ICML, volume 98, pages 91-99. Citeseer.

Brown, G., Wyatt, J., Harris, R., and Yao, X. (2005). Diversity creation methods: a survey and categorisation. Information Fusion, 6(1):5-20.

Celebi, M. E. and Kingravi, H. A. (2012). Deterministic initialization of the k-means algorithm using hierarchical clustering. International Journal of Pattern Recognition and Artificial Intelligence, 26(07):1250018. [OpenAIRE]

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391-407.

Greene, D., Cagney, G., Krogan, N., and Cunningham, P. (2008). Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions. Bioinformatics, 24(15):1722-1728. [OpenAIRE]

Greene, D., O'Callaghan, D., and Cunningham, P. (2014). How Many Topics? Stability Analysis for Topic Models. In Proc. European Conference on Machine Learning (ECML'14), pages 498-513. Springer.

Hadjitodorov, S. T., Kuncheva, L. I., and Todorova, L. P. (2006). Moderate diversity for better cluster ensembles. Information Fusion, 7(3):264-275.

Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J., and Weinberger, K. (2017). Snapshot Ensembles: Train 1 Get M for Free, in print. [OpenAIRE]

Kuang, D., Choo, J., and Park, H. (2015). Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering. In Partitional Clustering Algorithms, pages 215-243. Springer International Publishing, Cham.

33 references, page 1 of 3
Abstract
The problem of the instability of standard topic modeling algorithms is investigated.Three new stability measures for topic models are proposed.Two new ensemble approaches for topic modeling with matrix factorization are proposed.A detailed evaluation of these approaches is performed on 10 text corpora. Topic models can provide us with an insight into the underlying latent structure of a large corpus of documents. A range of methods have been proposed in the literature, including probabilistic topic models and techniques based on matrix factorization. However, in both cases, standard implementations rely on stochastic elements in their initialization phase, whic...
Subjects
free text keywords: Topic modeling, Topic stability, LDA, NMF, Computer Science - Information Retrieval, Computer Science - Computation and Language, Computer Science - Learning, Statistics - Machine Learning
33 references, page 1 of 3

Arora, S., Ge, R., and Moitra, A. (2012). Learning topic models - Going beyond SVD. In Proc. 53rd Symp. Foundations of Computer Science, pages 1-10. IEEE. [OpenAIRE]

Arthur, D. and Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027-1035. Society for Industrial and Applied Mathematics.

Ben-Hur, A., Elisseeff, A., and Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Proc. 7th Pacific Symposium on Biocomputing, pages 6-17.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3:993-1022.

Bouma, G. (2009). Normalized Pointwise Mutual Information in Collocation Extraction. In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology, GCSL '09.

Boutsidis, C. and Gallopoulos, E. (2008). SVD based initialization: A head start for non-negative matrix factorization. Pattern Recognition. [OpenAIRE]

Bradley, P. S. and Fayyad, U. M. (1998). Refining initial points for k-means clustering. In ICML, volume 98, pages 91-99. Citeseer.

Brown, G., Wyatt, J., Harris, R., and Yao, X. (2005). Diversity creation methods: a survey and categorisation. Information Fusion, 6(1):5-20.

Celebi, M. E. and Kingravi, H. A. (2012). Deterministic initialization of the k-means algorithm using hierarchical clustering. International Journal of Pattern Recognition and Artificial Intelligence, 26(07):1250018. [OpenAIRE]

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391-407.

Greene, D., Cagney, G., Krogan, N., and Cunningham, P. (2008). Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions. Bioinformatics, 24(15):1722-1728. [OpenAIRE]

Greene, D., O'Callaghan, D., and Cunningham, P. (2014). How Many Topics? Stability Analysis for Topic Models. In Proc. European Conference on Machine Learning (ECML'14), pages 498-513. Springer.

Hadjitodorov, S. T., Kuncheva, L. I., and Todorova, L. P. (2006). Moderate diversity for better cluster ensembles. Information Fusion, 7(3):264-275.

Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J., and Weinberger, K. (2017). Snapshot Ensembles: Train 1 Get M for Free, in print. [OpenAIRE]

Kuang, D., Choo, J., and Park, H. (2015). Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering. In Partitional Clustering Algorithms, pages 215-243. Springer International Publishing, Cham.

33 references, page 1 of 3
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue