Topic model validation

descriptionPublicationkeyboard_double_arrow_right Article 01 Jan 2012 Italy English Publisher:Elsevier BVJournal:Neurocomputing, volume 76, pages 125-133 (issn: 0925-2312,

Copyright policy )

Authors: Eduardo H. Ramírez; Ramón F. Brena; Davide Magatti; Fabio Stella;

doi: 10.1016/j.neucom.2011.04.032

handle: 10281/25061

Topic model validation

- Summary
- Subjects
- Metrics

Abstract

In this paper the problem of performing external validation of the semantic coherence of topic models is considered. The Fowlkes-Mallows index, a known clustering validation metric, is generalized for the case of overlapping partitions and multi-labeled collections, thus making it suitable for validating topic modeling algorithms. In addition, we propose new probabilistic metrics inspired by the concepts of recall and precision. The proposed metrics also have clear probabilistic interpretations and can be applied to validate and compare other soft and overlapping clustering algorithms. The approach is exemplified by using the Reuters-21578 multi-labeled collection to validate LDA models, then using Monte Carlo simulations to show the convergence to the correct results. Additional statistical evidence is provided to better understand the relation of the metrics presented.

Country

Italy

Related Organizations

Keywords

Fowlkes-Mallows index; Monte Carlo; Soft clustering; Topic models;, topic models, soft clustering, Fowlkes-Mallows index, Monte Carlo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	31
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%