publication . Conference object . 2016

MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP

Bérard, Alexandre; Servan, Christophe; Pietquin, Olivier; Besacier, Laurent;
Open Access English
  • Published: 23 May 2016
  • Publisher: HAL CCSD
Abstract
International audience; We present MultiVec, a new toolkit for computing continuous representations for text at different granularity levels (word-level or sequences of words). MultiVec includes Mikolov et al. [2013b]'s word2vec features, Le and Mikolov [2014]'s paragraph vector (batch and online) and Luong et al. [2015]'s model for bilingual distributed representations. MultiVec also includes different distance measures between words and sequences of words. The toolkit is written in C++ and is aimed at being fast (in the same order of magnitude as word2vec), easy to use, and easy to extend. It has been evaluated on several NLP tasks: the analogical reasoning ta...
Subjects
free text keywords: [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, crosslingual document classification, bilingual word embeddings, Word embeddings, paragraph vector

Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. The Journal of Machine Learning Research (JMLR), 3:1137-1155, 2003.

S. Gouws, Y. Bengio, and G. Corrado. BilBOWA: Fast Bilingual Distributed Representations without Word Alignments. In Proceedings of the International Conference on Machine Learning (ICML), 2015.

A. Klementiev, I. Titov, and B. Bhattarai. Inducing crosslingual distributed representations of words. In Proceedings of the International Conference on Computational Linguistics (COLING), 2012.

Q. V. Le and T. Mikolov. Distributed Representations of Sentences and Documents. In Proceedings of the International Conference on Machine Learning (ICML), 2014. [OpenAIRE]

T. Luong, H. Pham, and C. D. Manning. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, 2015.

G. Mesnil, M. Ranzato, T. Mikolov, and Y. Bengio. Ensemble of generative and discriminative techniques for sentiment analysis of movie reviews. arXiv:1412.5335 [cs], 2014. [OpenAIRE]

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word Representations in Vector Space. In The Workshop Proceedings of the International Conference on Learning Representations (ICLR), May 2013a.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems (NIPS), 2013b. [OpenAIRE]

A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. In Advances in Neural Information Processing Systems (NIPS), 2013.

R. Navigli and S. P. Ponzetto. BabelNet: The Automatic Construction, Evaluation and Application of a WideCoverage Multilingual Semantic Network. Artificial Intelligence, 193:217-250, 2012. [OpenAIRE]

H. Pham, M.-T. Luong, and C. D. Manning. Learning Distributed Representations for Multilingual Text Sequences. In Proceedings of NAACL-HLT, 2015.

Princeton University. About WordNet. Technical report, Princeton University, 2012.

R. Rˇ ehu˚rˇek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010.

G. Se´rasset. Dbnary: Wiktionary as a LMF based Multilingual RDF network. In Proceedings of the Language Resources and Evaluation Conference (LREC), 2012.

Abstract
International audience; We present MultiVec, a new toolkit for computing continuous representations for text at different granularity levels (word-level or sequences of words). MultiVec includes Mikolov et al. [2013b]'s word2vec features, Le and Mikolov [2014]'s paragraph vector (batch and online) and Luong et al. [2015]'s model for bilingual distributed representations. MultiVec also includes different distance measures between words and sequences of words. The toolkit is written in C++ and is aimed at being fast (in the same order of magnitude as word2vec), easy to use, and easy to extend. It has been evaluated on several NLP tasks: the analogical reasoning ta...
Subjects
free text keywords: [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, crosslingual document classification, bilingual word embeddings, Word embeddings, paragraph vector

Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. The Journal of Machine Learning Research (JMLR), 3:1137-1155, 2003.

S. Gouws, Y. Bengio, and G. Corrado. BilBOWA: Fast Bilingual Distributed Representations without Word Alignments. In Proceedings of the International Conference on Machine Learning (ICML), 2015.

A. Klementiev, I. Titov, and B. Bhattarai. Inducing crosslingual distributed representations of words. In Proceedings of the International Conference on Computational Linguistics (COLING), 2012.

Q. V. Le and T. Mikolov. Distributed Representations of Sentences and Documents. In Proceedings of the International Conference on Machine Learning (ICML), 2014. [OpenAIRE]

T. Luong, H. Pham, and C. D. Manning. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, 2015.

G. Mesnil, M. Ranzato, T. Mikolov, and Y. Bengio. Ensemble of generative and discriminative techniques for sentiment analysis of movie reviews. arXiv:1412.5335 [cs], 2014. [OpenAIRE]

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word Representations in Vector Space. In The Workshop Proceedings of the International Conference on Learning Representations (ICLR), May 2013a.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems (NIPS), 2013b. [OpenAIRE]

A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. In Advances in Neural Information Processing Systems (NIPS), 2013.

R. Navigli and S. P. Ponzetto. BabelNet: The Automatic Construction, Evaluation and Application of a WideCoverage Multilingual Semantic Network. Artificial Intelligence, 193:217-250, 2012. [OpenAIRE]

H. Pham, M.-T. Luong, and C. D. Manning. Learning Distributed Representations for Multilingual Text Sequences. In Proceedings of NAACL-HLT, 2015.

Princeton University. About WordNet. Technical report, Princeton University, 2012.

R. Rˇ ehu˚rˇek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010.

G. Se´rasset. Dbnary: Wiktionary as a LMF based Multilingual RDF network. In Proceedings of the Language Resources and Evaluation Conference (LREC), 2012.

Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue
publication . Conference object . 2016

MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP

Bérard, Alexandre; Servan, Christophe; Pietquin, Olivier; Besacier, Laurent;