An Experimental Assessment of the Efficacy of BERTopic

Topic modelling is an unsupervised machine-learning technique for finding abstract topics in a large collection of documents. It helps in organizing, understanding, and summarizing large collections of textual information while discovering the latent topics that vary among documents in a given corpus. Recently, newly developed algorithms for topic modelling, such as BERTopic have gained significant attention from researchers and continue to attract growing interest. This research not only sheds light on the efficacy of using these advanced algorithms but also emphasizes the importance of possessing certain technical skills for conducting meaningful investigations in this domain. Efficient, speedy, and scalable implementations of these algorithms are essential for handling vast corpora of text data. Additionally, to ensure the success of this study and meaningful comparisons among various topic modelling approaches, proficiency in technical skills such as data analysis and data visualization is imperative. Utilizing Python as the programming language of choice provides the flexibility and robustness required for algorithmic implementations, while a solid foundation in statistical modelling and mathematical skills is indispensable for accurate calculation and prediction. Specifically, the main contribution of the study is to introduce the NMI (Normalized Mutual Information) and modularity which are the two different evaluation metrics used to assess the quality of clusters or topics generated by clustering algorithms, including those used in BERTopic. In essence, this research not only explores the effectiveness of state-of-the-art topic modelling algorithms but also underscores the significance of technical expertise in data analysis, data visualization, Python programming, and statistical modelling to facilitate comprehensive comparisons within the field of topic modelling.

Country

Italy

Related Organizations

University of Padua
Italy

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green