One billion word benchmark for measuring progress in statistical language modeling

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 14 Sep 2014Embargo end date: 01 Jan 2013Publisher:ISCAJournal:Interspeech 2014

Authors: Ciprian Chelba; Tomás Mikolov; Mike Schuster; Qi Ge; Thorsten Brants; Phillipp Koehn; Tony Robinson;

doi: 10.21437/interspeech.2014-564 , 10.48550/arxiv.1312.3005

arXiv: 1312.3005

One billion word benchmark for measuring progress in statistical language modeling

- Summary
- Subjects
- Metrics

Abstract

We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6; a combination of techniques leads to 35% reduction in perplexity, or 10% reduction in cross-entropy (bits), over that baseline. The benchmark is available as a code.google.com project; besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline n-gram models.

Accompanied by a code.google.com project allowing anyone to generate the benchmark data, and use it to compare their language model against the ones described in the paper

Related Organizations

Keywords

FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	197
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 0.1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 0.1%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

197

Top 0.1%

Top 10%

Green

Fields of Science (4) View all

natural sciences

Fields of Science

natural sciences

View all