Text Categorization with Latent Dirichlet Allocation

ZLACKÝ Daniel; STAŠ Ján; JUHÁR Jozef; CIŽMÁR Anton

Found an issue? Give us feedback

Journal of Electrica...arrow_drop_down

Journal of Electrical and Electronics Engineering

Article . 2014

Data sources: DOAJ

Text Categorization with Latent Dirichlet Allocation

descriptionPublicationkeyboard_double_arrow_right Article 01 May 2014 English Publisher:Editura Universităţii din OradeaJournal:Journal of Electrical and Electronics Engineering (issn: 1844-6035,

Copyright policy )

Authors: ZLACKÝ Daniel; STAŠ Ján; JUHÁR Jozef; CIŽMÁR Anton;

Text Categorization with Latent Dirichlet Allocation

- Summary
- Subjects
- Metrics

Abstract

This paper focuses on the text categorization of Slovak text corpora using latent Dirichlet allocation. Our goal is to build text subcorpora that contain similar text documents. We want to use these better organized text subcorpora to build more robust language models that can be used in the area of speech recognition systems. Our previous research in the area of text categorization showed that we can achieve better results with categorized text corpora. In this paper we used latent Dirichlet allocation for text categorization. We divided initial text corpus into 2, 5, 10, 20 or 100 subcorpora with various iterations and save steps. Language models were built on these subcorpora and adapted with linear interpolation to judicial domain. The experiment results showed that text categorization using latent Dirichlet allocation can improve the system for automatic speech recognition by creating the language models from organized text corpora.

Keywords

language modeling, text categorization, speech recognition, latent Dirichlet allocation, Electrical engineering. Electronics. Nuclear engineering, TK1-9971

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

gold