Multinomial Loss on Held-out Data for the Sparse Non-negative Matrix Language Model

Preprint English OPEN
Chelba, Ciprian ; Pereira, Fernando (2015)
  • Subject: Computer Science - Computation and Language

We describe Sparse Non-negative Matrix (SNM) language model estimation using multinomial loss on held-out data. Being able to train on held-out data is important in practical situations where the training data is usually mismatched from the held-out/test data. It is also less constrained than the previous training algorithm using leave-one-out on training data: it allows the use of richer meta-features in the adjustment model, e.g. the diversity counts used by Kneser-Ney smoothing which would be difficult to deal with correctly in leave-one-out training. In experiments on the one billion words language modeling benchmark, we are able to slightly improve on our previous results which use a different loss function, and employ leave-one-out training on a subset of the main training set. Surprisingly, an adjustment model with meta-features that discard all lexical information can perform as well as lexicalized meta-features. We find that fairly small amounts of held-out data (on the order of 30-70 thousand words) are sufficient for training the adjustment model. In a real-life scenario where the training data is a mix of data sources that are imbalanced in size, and of different degrees of relevance to the held-out and test data, taking into account the data source for a given skip-/n-gram feature and combining them for best performance on held-out/test data improves over skip-/n-gram SNM models trained on pooled data by about 8% in the SMT setup, or as much as 15% in the ASR/IME setup. The ability to mix various data sources based on how relevant they are to a mismatched held-out set is probably the most attractive feature of the new estimation method for SNM LM.
  • References (14)
    14 references, page 1 of 2

    [1] Cyril Allauzen and Michael Riley. “Bayesian Language Model Interpolation for Mobile Speech Input,” Proceedings of Interspeech, 1429-1432, 2011.

    [2] Chang et al. “Bigtable: A distributed storage system for structured data,” ACM Transactions on Computer Systems, vol. 26, pp. 1-26, num. 2, 2008.

    [3] Ciprian Chelba, Toma´sˇ Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. “One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling,” Proceedings of Interspeech, 2635-2639, 2014.

    [4] Ciprian Chelba, Noam Shazeer. “Sparse Non-negative Matrix Language Modeling for Geo-annotated Query Session Data,” ASRU, to appear, 2015.

    [5] John Duchi, Elad Hazan and Yoram Singer. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization,” Journal of Machine Learning Research, 12, 2121-2159, 2011.

    [6] Kuzman Ganchev and Mark Dredze. “Small statistical models by random feature mixing,” Proceedings of the ACL-2008 Workshop on Mobile Language Processing, Association for Computational Linguistics, 2008.

    [7] Sanjay Ghemawat and Jeff Dean. “MapReduce: Simplified data processing on large clusters,” Proceedings of OSDI, 2004.

    [8] Frederick Jelinek. “Statistical Methods for Speech Recognition,” 1997. MIT Press, Cambridge, MA, USA.

    [9] Toma´sˇ Mikolov, Anoop Deoras, Daniel Povey, Luka´s Burget and Jan Cernocky´ . “Strategies for training large scale neural network language models,” Proceedings of ASRU, 196-201, 2011.

    [10] Joris Pelemans, Noam M. Shazeer and Ciprian Chelba. “Pruning Sparse Non-negative Matrix N-gram Language Models,” Proceedings of Interspeech, 1433-1437, 2015.

  • Similar Research Results (1)
  • Metrics
    No metrics available
Share - Bookmark