publication . Conference object . 2020

CAMEMBERT Contextual Language Models for French: Impact of Training Data Size andHeterogeneity

Martin, Louis; Muller, Benjamin; Ortiz Suárez, Pedro Javier; Dupont, Yoan; Romary, Laurent; Villemonte de la Clergerie, Eric; Sagot, Benoît; Seddah, Djamé;
French
  • Published: 08 Jun 2020
  • Publisher: HAL CCSD
  • Country: France
Abstract
National audience; Contextual word embeddings have become ubiquitous in Natural Language Processing. Until recently,most available models were trained on English data or on the concatenation of corpora in multiplelanguages. This made the practical use of models in all languages except English very limited.The recent release of monolingual versions of BERT (Devlinet al., 2019) for French establisheda new state-of-the-art for all evaluated tasks. In this paper, based on experiments on CamemBERT(Martinet al., 2019), we show that pretraining such models on highly variable datasets leads to betterdownstream performance compared to models trained on more uniform data....
Subjects
free text keywords: CamemBERT, BERT, Dataset impact, Contextual language models, Impact jeu de données, Modèles de langue contextuels, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Funded by
ANR| PRAIRIE
Project
PRAIRIE
PaRis Artificial Intelligence Research InstitutE
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-19-P3IA-0001
,
ANR| PARSITI
Project
PARSITI
Parsing the Impossible, Translating the Improbable
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-16-CE33-0021
,
ANR| BASNUM
Project
BASNUM
Digitization and analysis of the Dictionnaire universel by Basnage de Beauval: lexicography and scientific networks
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-18-CE38-0003
,
ANR| SoSweet
Project
SoSweet
A sociolinguistics of Twitter : social links and linguistic variations
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-15-CE38-0011
Any information missing or wrong?Report an Issue