publication . Preprint . Conference object . 2020

CamemBERT: a Tasty French Language Model

Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot;
Open Access English
  • Published: 05 Jul 2020
  • Publisher: HAL CCSD
  • Country: France
Abstract
Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the u...
Persistent Identifiers
Subjects
free text keywords: [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Computer Science - Computation and Language, Computer science, Natural language inference, Concatenation, French, language.human_language, language, Language model, Natural language processing, computer.software_genre, computer, Transformer (machine learning model), Artificial intelligence, business.industry, business, Named-entity recognition, Dependency grammar
Communities
Communities with gateway
OpenAIRE Connect image
Funded by
ANR| PARSITI
Project
PARSITI
Parsing the Impossible, Translating the Improbable
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-16-CE33-0021
,
ANR| PRAIRIE
Project
PRAIRIE
PaRis Artificial Intelligence Research InstitutE
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-19-P3IA-0001
,
ANR| BASNUM
Project
BASNUM
Digitization and analysis of the Dictionnaire universel by Basnage de Beauval: lexicography and scientific networks
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-18-CE38-0003
,
ANR| SoSweet
Project
SoSweet
A sociolinguistics of Twitter : social links and linguistic variations
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-15-CE38-0011
61 references, page 1 of 5

Anne Abeillé, Lionel Clément, and François Toussenel. 2003. Building a Treebank for French, pages 165-187. Kluwer, Dordrecht. [OpenAIRE]

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 1638-1649. Association for Computational Linguistics.

Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res., 6:1817-1853.

Rachel Bawden, Marie-Amélie Botalla, Kim Gerdes, and Sylvain Kahane. 2014. Correcting and validating syntactic dependency in the spoken French treebank rhapsodie. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2320-2325, Reykjavik, Iceland. European Language Resources Association (ELRA). [OpenAIRE]

Michaël Benesty. 2019. Ner algo benchmark: spacy, flair, m-bert and camembert on anonymizing french commercial legal cases.

Peter F. Brown, Vincent J. Della Pietra, Peter V. de Souza, Jennifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467- 479.

Marie Candito and Benoit Crabbé. 2009. Improving generative statistical parsing with semisupervised word clustering. In Proc. of IWPT'09, Paris, France. [OpenAIRE]

Marie Candito, Guy Perrier, Bruno Guillaume, Corentin Ribeyre, Karën Fort, Djamé Seddah, and Éric Villemonte de la Clergerie. 2014. Deep syntax annotation of the sequoia french treebank. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, May 26-31, 2014., pages 2298-2305. European Language Resources Association (ELRA).

Marie Candito and Djamé Seddah. 2012. Le corpus sequoia : annotation syntaxique et exploitation pour l'adaptation d'analyseur par pont lexical (the sequoia corpus : Syntactic annotation and use for a parser lexical domain adaptation method) [in french]. In Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN, Grenoble, France, June 4-8, 2012, pages 321-334.

Branden Chan, Timo Möller, Malte Pietsch, Tanay Soni, and Chin Man Yeung. 2019. German bert. https://deepset.ai/german-bert.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. ArXiv preprint : 1911.02116.

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2475-2485. Association for Computational Linguistics.

Andrew M. Dai and Quoc V. Le. 2015. Semisupervised sequence learning. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 3079-3087.

Pieter Delobelle, Thomas Winters, and Bettina Berendt. 2020. RobBERT: a Dutch RoBERTabased Language Model. ArXiv preprint 2001.06286.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Multilingual bert. https://github.com/google-research/ bert/blob/master/multilingual.md.

61 references, page 1 of 5
Abstract
Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the u...
Persistent Identifiers
Subjects
free text keywords: [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Computer Science - Computation and Language, Computer science, Natural language inference, Concatenation, French, language.human_language, language, Language model, Natural language processing, computer.software_genre, computer, Transformer (machine learning model), Artificial intelligence, business.industry, business, Named-entity recognition, Dependency grammar
Communities
Communities with gateway
OpenAIRE Connect image
Funded by
ANR| PARSITI
Project
PARSITI
Parsing the Impossible, Translating the Improbable
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-16-CE33-0021
,
ANR| PRAIRIE
Project
PRAIRIE
PaRis Artificial Intelligence Research InstitutE
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-19-P3IA-0001
,
ANR| BASNUM
Project
BASNUM
Digitization and analysis of the Dictionnaire universel by Basnage de Beauval: lexicography and scientific networks
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-18-CE38-0003
,
ANR| SoSweet
Project
SoSweet
A sociolinguistics of Twitter : social links and linguistic variations
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-15-CE38-0011
61 references, page 1 of 5

Anne Abeillé, Lionel Clément, and François Toussenel. 2003. Building a Treebank for French, pages 165-187. Kluwer, Dordrecht. [OpenAIRE]

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 1638-1649. Association for Computational Linguistics.

Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res., 6:1817-1853.

Rachel Bawden, Marie-Amélie Botalla, Kim Gerdes, and Sylvain Kahane. 2014. Correcting and validating syntactic dependency in the spoken French treebank rhapsodie. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2320-2325, Reykjavik, Iceland. European Language Resources Association (ELRA). [OpenAIRE]

Michaël Benesty. 2019. Ner algo benchmark: spacy, flair, m-bert and camembert on anonymizing french commercial legal cases.

Peter F. Brown, Vincent J. Della Pietra, Peter V. de Souza, Jennifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467- 479.

Marie Candito and Benoit Crabbé. 2009. Improving generative statistical parsing with semisupervised word clustering. In Proc. of IWPT'09, Paris, France. [OpenAIRE]

Marie Candito, Guy Perrier, Bruno Guillaume, Corentin Ribeyre, Karën Fort, Djamé Seddah, and Éric Villemonte de la Clergerie. 2014. Deep syntax annotation of the sequoia french treebank. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, May 26-31, 2014., pages 2298-2305. European Language Resources Association (ELRA).

Marie Candito and Djamé Seddah. 2012. Le corpus sequoia : annotation syntaxique et exploitation pour l'adaptation d'analyseur par pont lexical (the sequoia corpus : Syntactic annotation and use for a parser lexical domain adaptation method) [in french]. In Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN, Grenoble, France, June 4-8, 2012, pages 321-334.

Branden Chan, Timo Möller, Malte Pietsch, Tanay Soni, and Chin Man Yeung. 2019. German bert. https://deepset.ai/german-bert.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. ArXiv preprint : 1911.02116.

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2475-2485. Association for Computational Linguistics.

Andrew M. Dai and Quoc V. Le. 2015. Semisupervised sequence learning. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 3079-3087.

Pieter Delobelle, Thomas Winters, and Bettina Berendt. 2020. RobBERT: a Dutch RoBERTabased Language Model. ArXiv preprint 2001.06286.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Multilingual bert. https://github.com/google-research/ bert/blob/master/multilingual.md.

61 references, page 1 of 5
Any information missing or wrong?Report an Issue