An adaptive method for determining the optimal number of topics in topic modeling

descriptionPublicationkeyboard_double_arrow_right Article 28 Feb 2025 English Publisher:PeerJJournal:PeerJ Computer Science, volume 11, page e2723 (eissn: 2376-5992,

Copyright policy )

Authors: Yang Xu; Yueyi Zhang; Yefang Sun; Hanting Zhou;

doi: 10.7717/peerj-cs.2723

pmid: 40062265

pmc: PMC11888909

An adaptive method for determining the optimal number of topics in topic modeling

- Summary
- Subjects
- Metrics

Abstract

Topic models have been successfully applied to information classification and retrieval. The difficulty in successfully applying these technologies is to select the appropriate number of topics for a given corpus. Selecting too few topics can result in information loss and topic omission, known as underfitting. Conversely, an excess of topics can introduce noise and complexity, resulting in overfitting. Therefore, this article considers the inter-class distance and proposes a new method to determine the number of topics based on clustering results, named average inter-class distance change rate (AICDR). AICDR employs the Ward’s method to calculate inter-class distances, then calculates the average inter-class distance for different numbers of topics, and determines the optimal number of topics based on the average distance change rate. Experiments show that the number of topics determined by AICDR is more in line with the true classification of datasets, with high inter-class distance and low inter-class similarity, avoiding the phenomenon of topic overlap. AICDR is a technique predicated on clustering results to select the optimal number of topics and has strong adaptability to various topic models.

Related Organizations

China Jiliang University
China (People's Republic of)
Hangzhou Dianzi University
China (People's Republic of)

Keywords

Optimal number of topics, Algorithms and Analysis of Algorithms, AICDR, Inter-class distance, Electronic computers. Computer science, QA75.5-76.95, Topic modeling

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Green

gold