Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics

descriptionPublicationkeyboard_double_arrow_right Article 03 Jan 2024 English Publisher:PeerJJournal:PeerJ Computer Science, volume 10, page e1758 (eissn: 2376-5992,

Copyright policy )

Authors: Sergei Koltcov; Anton Surkov; Vladimir Filippov; Vera Ignatenko;

doi: 10.7717/peerj-cs.1758

pmid: 38196953

pmc: PMC10773852

Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics

- Summary
- Subjects
- Metrics

Abstract

Topic modeling is a widely used instrument for the analysis of large text collections. In the last few years, neural topic models and models with word embeddings have been proposed to increase the quality of topic solutions. However, these models were not extensively tested in terms of stability and interpretability. Moreover, the question of selecting the number of topics (a model parameter) remains a challenging task. We aim to partially fill this gap by testing four well-known and available to a wide range of users topic models such as the embedded topic model (ETM), Gaussian Softmax distribution model (GSM), Wasserstein autoencoders with Dirichlet prior (W-LDA), and Wasserstein autoencoders with Gaussian Mixture prior (WTM-GMM). We demonstrate that W-LDA, WTM-GMM, and GSM possess poor stability that complicates their application in practice. ETM model with additionally trained embeddings demonstrates high coherence and rather good stability for large datasets, but the question of the number of topics remains unsolved for this model. We also propose a new topic model based on granulated sampling with word embeddings (GLDAW), demonstrating the highest stability and good coherence compared to other considered models. Moreover, the optimal number of topics in a dataset can be determined for this model.

Related Organizations

Scientific Research Engineering Institute
Russian Federation
National Research University Higher School of Economics
Russian Federation

Keywords

Optimal number of topics, Renyi entropy, Neural topic models, Electronic computers. Computer science, Data Mining and Machine Learning, QA75.5-76.95, Stability, Coherence, Topic modeling

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	6
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%