Powered by OpenAIRE graph
Found an issue? Give us feedback
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l'hétérogénéité des données d'entrainement

Authors: Louis Martin; Benjamin Muller; Pedro Ortiz Suarez; Yoan Dupont; Laurent Romary; Eric Villemonte de La Clergerie; Benoît Sagot; +1 Authors

Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l'hétérogénéité des données d'entrainement

Abstract

National audience; Contextual word embeddings have become ubiquitous in Natural Language Processing. Until recently,most available models were trained on English data or on the concatenation of corpora in multiplelanguages. This made the practical use of models in all languages except English very limited.The recent release of monolingual versions of BERT (Devlinet al., 2019) for French establisheda new state-of-the-art for all evaluated tasks. In this paper, based on experiments on CamemBERT(Martinet al., 2019), we show that pretraining such models on highly variable datasets leads to betterdownstream performance compared to models trained on more uniform data. Moreover, we show thata relatively small amount of web crawled data (4GB) leads to downstream performances as good as amodel pretrained on a corpus two orders of magnitude larger (138GB); Les modèles de langue neuronaux contextuels sont désormais omniprésents en traitement automatique des langues. Jusqu’à récemment, la plupart des modèles disponibles ont été entraînés soit sur des données en anglais, soit sur la concaténation de données dans plusieurs langues. L’utilisation pratique de ces modèles — dans toutes les langues sauf l’anglais — était donc limitée. La sortie récente de plusieurs modèles monolingues fondés sur BERT (Devlin et al., 2019), notamment pour le français, a démontré l’intérêt de ces modèles en améliorant l’état de l’art pour toutes les tâches évaluées. Dans cet article, à partir d’expériences menées sur CamemBERT (Martin et al., 2019), nous montrons que l’utilisation de données à haute variabilité est préférable à des données plus uniformes. De façon plus surprenante, nous montrons que l’utilisation d’un ensemble relativement petit de données issues du web (4Go) donne des résultats aussi bons que ceux obtenus à partir d’ensembles de données plus grands de deux ordres de grandeurs (138Go).

Country
France
Keywords

Impact jeu de données, Dataset impact, Modèles de langue contextuels, Contextual language models, CamemBERT, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], BERT

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
  • citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    Powered byBIP!BIP!
Powered by OpenAIRE graph
Found an issue? Give us feedback
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
Average
Average
Average
Funded by
ANR| SoSweet
Project
SoSweet
A sociolinguistics of Twitter : social links and linguistic variations
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-15-CE38-0011
,
ANR| BASNUM
Project
BASNUM
Digitization and analysis of the Dictionnaire universel by Basnage de Beauval: lexicography and scientific networks
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-18-CE38-0003
iis
,
ANR| PRAIRIE
Project
PRAIRIE
PaRis Artificial Intelligence Research InstitutE
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-19-P3IA-0001
iis
,
ANR| PARSITI
Project
PARSITI
Parsing the Impossible, Translating the Improbable
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-16-CE33-0021
iis
moresidebar

Do the share buttons not appear? Please make sure, any blocking addon is disabled, and then reload the page.