
A procedure of an automatic processing of a text corpus, collected from a number of news Internet sites for creation of a n-gram model of the Russian spoken language, is described in this paper. A statistic analysis of the corpus is presented, the results of the computation of appearance of different n-grams are given. A review of the state-of-the-art statistical language models is presented as well.
Описывается процесс автоматической обработки текстового корпуса, собранного из новостных лент ряда интернет-сайтов, для создания вероятностной n-граммной модели разговорного русского языка. Приводится статистический анализ данного корпуса, даются результаты по подсчету частоты появления различных n-грамм слов. Представлен обзор существующих типов статистических моделей языка.
МОДЕЛЬ ЯЗЫКА, ТЕКСТОВЫЙ КОРПУС РУССКОГО ЯЗЫКА, АВТОМАТИЧЕСКАЯ ОБРАБОТКА ТЕКСТА
МОДЕЛЬ ЯЗЫКА, ТЕКСТОВЫЙ КОРПУС РУССКОГО ЯЗЫКА, АВТОМАТИЧЕСКАЯ ОБРАБОТКА ТЕКСТА
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
