Probit Normal Correlated Topic Model

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2014Embargo end date: 01 Jan 2014 United States Publisher:Scientific Research Publishing, Inc.Journal:Open Journal of Statistics, volume 4, pages 879-888 (issn: 2161-718X, eissn: 2161-7198,

Copyright policy )

Authors: Yu, Xingchen; FokouÃ©, Ernest;

doi: 10.4236/ojs.2014.411083 , 10.48550/arxiv.1410.0908

arXiv: 1410.0908

Probit Normal Correlated Topic Model

- Summary
- Subjects
- Metrics

Abstract

The logistic normal distribution has recently been adapted via the transformation of multivariate Gaus- sian variables to model the topical distribution of documents in the presence of correlations among topics. In this paper, we propose a probit normal alternative approach to modelling correlated topical structures. Our use of the probit model in the context of topic discovery is novel, as many authors have so far con- centrated solely of the logistic model partly due to the formidable inefficiency of the multinomial probit model even in the case of very small topical spaces. We herein circumvent the inefficiency of multinomial probit estimation by using an adaptation of the diagonal orthant multinomial probit in the topic models context, resulting in the ability of our topic modelling scheme to handle corpuses with a large number of latent topics. An additional and very important benefit of our method lies in the fact that unlike with the logistic normal model whose non-conjugacy leads to the need for sophisticated sampling schemes, our ap- proach exploits the natural conjugacy inherent in the auxiliary formulation of the probit model to achieve greater simplicity. The application of our proposed scheme to a well known Associated Press corpus not only helps discover a large number of meaningful topics but also reveals the capturing of compellingly intuitive correlations among certain topics. Besides, our proposed approach lends itself to even further scalability thanks to various existing high performance algorithms and architectures capable of handling millions of documents.

11 pages, 2 figures and 2 tables

Country

United States

Related Organizations

Rochester Institute of Technology
United States

Keywords

Auxiliary Variable, FOS: Computer and information sciences, Computer Science - Machine Learning, Dirichlet, Machine Learning (stat.ML), Correlation Structure, Bayesian, Vocabulary, Gibbs Sampler, Computer Science - Information Retrieval, Machine Learning (cs.LG), Statistics - Machine Learning, 62H25, 62H30, Orthant, Conjugate, Topic, Efficient Sampling, Cumulative Distribution Function, Gaussian, Logit, Probit, Information Retrieval (cs.IR)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average