publication . Preprint . Conference object . 2020

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

Pedro Javier Ortiz Suárez; Laurent Romary; Benoît Sagot;
Open Access English
  • Published: 05 Jul 2020
  • Country: France
Abstract
International audience; We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for several mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In pa...
Persistent Identifiers
Subjects
ACM Computing Classification System: InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL
free text keywords: Computer Science - Computation and Language, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Parsing, computer.software_genre, computer, Computer science, Artificial intelligence, business.industry, business, Embedding, Natural language processing
Communities
Communities with gateway
OpenAIRE Connect image
Funded by
ANR| BASNUM
Project
BASNUM
Digitization and analysis of the Dictionnaire universel by Basnage de Beauval: lexicography and scientific networks
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-18-CE38-0003
,
ANR| PRAIRIE
Project
PRAIRIE
PaRis Artificial Intelligence Research InstitutE
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-19-P3IA-0001
23 references, page 1 of 2

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 1638-1649. Association for Computational Linguistics. Timothy Dozat, Peng Qi, and

Rami Al-Rfou, Bryan Per- Christopher D. Manning. 2017. ozzi, and Steven Skiena. 2013. Stanford's graph-based neural dependency parser at the CoNLL 201 Polyglot: Distributed word representations for multilingualINnLPPr.oceedings of the CoNLL 2017 Shared Task: In Proceedings of the Seventeenth Conference on Multilingual Parsing from Raw Text to Universal Computational Natural Language Learning, pages Dependencies, pages 20-30, Vancouver, Canada. 183-192, Sofia, Bulgaria. Association for Computa- Association for Computational Linguistics. tional Linguistics. Edouard Grave, Piotr Bojanowski, Prakhar Gupta,

Bernd Bohnet, Ryan McDonald, Gonçalo Armand Joulin, and Tomas Mikolov. 2018. Simões, Daniel Andor, Emily Learning word vectors for 157 languages. In Pitler, and Joshua Maynez. 2018. Proceedings of the 11th Language Resources and Morphosyntactic tagging with a meta-BiLSTM model overEcvoanltueaxttiosennCsiotinvfeerteonkceen, eMnciyoadzinagksi,. Japan. European In Proceedings of the 56th Annual Meeting of the Language Resource Association. Association for Computational Linguistics (Volume 1: Long Papers), pages 2642-2652, Melbourne, Australia. Association for Computational Linguistics.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735-1780.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Languag arXiv e-prints, page arXiv:1810.04805.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Multilingual BERT. https://github.com/google-research/bert/blob/master

Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Romary. 2019. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. Challenges in the Management of Large Corpora (CMLC-7) 2019, page 9.

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543, Doha, Qatar. Association for Computational Linguistics.

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227-2237, New Orleans, Louisiana. Association for Computational Linguistics.

Slav Petrov, Dipanjan Das, and Ryan T. McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, pages 2089- 2096. European Language Resources Association (ELRA).

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. OpenAI Blog.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1:8.

In Proceedings of the CoNLL 2017 Shared Task:

Dependencies, pages 88-99, Vancouver, Canada.

Milan Straka, Jana Straková, and Jan Hajic. 2019. Evaluating contextualized embeddings on 54 languages in POS taggi CoRR, abs/1908.07448.

23 references, page 1 of 2
Abstract
International audience; We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for several mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In pa...
Persistent Identifiers
Subjects
ACM Computing Classification System: InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL
free text keywords: Computer Science - Computation and Language, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Parsing, computer.software_genre, computer, Computer science, Artificial intelligence, business.industry, business, Embedding, Natural language processing
Communities
Communities with gateway
OpenAIRE Connect image
Funded by
ANR| BASNUM
Project
BASNUM
Digitization and analysis of the Dictionnaire universel by Basnage de Beauval: lexicography and scientific networks
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-18-CE38-0003
,
ANR| PRAIRIE
Project
PRAIRIE
PaRis Artificial Intelligence Research InstitutE
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-19-P3IA-0001
23 references, page 1 of 2

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 1638-1649. Association for Computational Linguistics. Timothy Dozat, Peng Qi, and

Rami Al-Rfou, Bryan Per- Christopher D. Manning. 2017. ozzi, and Steven Skiena. 2013. Stanford's graph-based neural dependency parser at the CoNLL 201 Polyglot: Distributed word representations for multilingualINnLPPr.oceedings of the CoNLL 2017 Shared Task: In Proceedings of the Seventeenth Conference on Multilingual Parsing from Raw Text to Universal Computational Natural Language Learning, pages Dependencies, pages 20-30, Vancouver, Canada. 183-192, Sofia, Bulgaria. Association for Computa- Association for Computational Linguistics. tional Linguistics. Edouard Grave, Piotr Bojanowski, Prakhar Gupta,

Bernd Bohnet, Ryan McDonald, Gonçalo Armand Joulin, and Tomas Mikolov. 2018. Simões, Daniel Andor, Emily Learning word vectors for 157 languages. In Pitler, and Joshua Maynez. 2018. Proceedings of the 11th Language Resources and Morphosyntactic tagging with a meta-BiLSTM model overEcvoanltueaxttiosennCsiotinvfeerteonkceen, eMnciyoadzinagksi,. Japan. European In Proceedings of the 56th Annual Meeting of the Language Resource Association. Association for Computational Linguistics (Volume 1: Long Papers), pages 2642-2652, Melbourne, Australia. Association for Computational Linguistics.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735-1780.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Languag arXiv e-prints, page arXiv:1810.04805.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Multilingual BERT. https://github.com/google-research/bert/blob/master

Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Romary. 2019. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. Challenges in the Management of Large Corpora (CMLC-7) 2019, page 9.

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543, Doha, Qatar. Association for Computational Linguistics.

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227-2237, New Orleans, Louisiana. Association for Computational Linguistics.

Slav Petrov, Dipanjan Das, and Ryan T. McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, pages 2089- 2096. European Language Resources Association (ELRA).

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. OpenAI Blog.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1:8.

In Proceedings of the CoNLL 2017 Shared Task:

Dependencies, pages 88-99, Vancouver, Canada.

Milan Straka, Jana Straková, and Jan Hajic. 2019. Evaluating contextualized embeddings on 54 languages in POS taggi CoRR, abs/1908.07448.

23 references, page 1 of 2
Any information missing or wrong?Report an Issue