A New Sentence-Based Interpretative Topic Modeling and Automatic Topic Labeling

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 10 May 2021 English Publisher:MDPI AGJournal:Symmetry, volume 13, page 837 (eissn: 2073-8994,

Copyright policy )

Authors: Olzhas Kozbagarov; Rustam Mussabayev; Nenad Mladenovic;

doi: 10.3390/sym13050837

A New Sentence-Based Interpretative Topic Modeling and Automatic Topic Labeling

- Summary
- Subjects
- Metrics

Abstract

This article presents a new conceptual approach for the interpretative topic modeling problem. It uses sentences as basic units of analysis, instead of words or n-grams, which are commonly used in the standard approaches.The proposed approach’s specifics are using sentence probability evaluations within the text corpus and clustering of sentence embeddings. The topic model estimates discrete distributions of sentence occurrences within topics and discrete distributions of topic occurrence within the text. Our approach provides the possibility of explicit interpretation of topics since sentences, unlike words, are more informative and have complete grammatical and semantic constructions inside. The method for automatic topic labeling is also provided. Contextual embeddings based on the BERT model are used to obtain corresponding sentence embeddings for their subsequent analysis. Moreover, our approach allows big data processing and shows the possibility of utilizing the combination of internal and external knowledge sources in the process of topic modeling. The internal knowledge source is represented by the text corpus itself and often it is a single knowledge source in the traditional topic modeling approaches. The external knowledge source is represented by the BERT, a machine learning model which was preliminarily trained on a huge amount of textual data and is used for generating the context-dependent sentence embeddings.

Related Organizations

Institute of Information and Computational Technologies
Kazakhstan
Khalifa University of Science and Technology
United Arab Emirates

Keywords

machine learning, big data, minimum sum-of-squares clustering (MSSC), topic modeling, automatic topic labeling, natural language processing, transfer learning, BERT

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	7
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%