
Traditional language models rely on lexical units that are defined as entities separated from each other by word boundary markers. Since there are no such boundaries in Thai, alternative definitions of lexical units have to be pursued. The problem is to find the optimal set of lexical units that constitutes the vocabulary of the language model and yields the best final result. The word is a traditional lexical unit recognized by Thai people and is used by most of the natural language processing systems, including an automatic speech recognition system. This paper discusses problems with using words as a lexical unit and investigates other lexical units for the Thai large vocabulary continuous speech recognition (LVCSR) system. The pseudo-morpheme is introduced in the paper and shown to be unsuitable for use as a lexical unit directly. A technique using pseudo-morphemes to improve the system based on the traditional word model is introduced and some improvements can be gained by this technique. Then, a new lexical unit for Thai, the compound pseudo-morpheme, and an algorithm to build compound pseudo-morphemes are presented. The experimental results show that the system using compound pseudo-morphemes outperforms other systems. Thus, the compound pseudo-morpheme is the most suitable lexical unit for Thai LVCSR system.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 8 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
