publication . Preprint . Conference object . 2020

Establishing a New State-of-the-Art for French Named Entity Recognition

Ortiz Suárez, Pedro Javier; Dupont, Yoann; Muller, Benjamin; Romary, Laurent; Sagot, Benoît;
Open Access English
  • Published: 11 May 2020
  • Publisher: HAL CCSD
  • Country: France
Abstract
Due to COVID19 pandemic, the 12th edition is cancelled. The LREC 2020 Proceedings are available at http://www.lrec-conf.org/proceedings/lrec2020/index.html; International audience; The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no large-scale French corpus with named entity annotations contain referential information, which complement the type and the span of each mention with an indication of the entity it refers to. We have manually annotated the French TreeBank with such information, after an automatic pre-annotation step. We sketch the underlying annotation guidelines and we provide a few figures about the resulting annotations.
Subjects
free text keywords: Named Entity Recognition, French, Language Modeling, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Computer Science - Computation and Language
Communities
Communities with gateway
OpenAIRE Connect image
Funded by
ANR| PRAIRIE
Project
PRAIRIE
PaRis Artificial Intelligence Research InstitutE
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-19-P3IA-0001
,
ANR| BASNUM
Project
BASNUM
Digitization and analysis of the Dictionnaire universel by Basnage de Beauval: lexicography and scientific networks
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-18-CE38-0003
38 references, page 1 of 3

Akbik, A., Blythe, D., and Vollgraf, R. (2018). Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 1638-1649.

Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., and Auli, M. (2019). Cloze-driven pretraining of self-attention networks. CoRR, abs/1903.07785. [OpenAIRE]

Bechet, F. and Charton, E. (2010). Unsupervised knowledge acquisition for extracting named entities from speech. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, USA.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135-146.

Béchet, F., Sagot, B., and Stern, R. (2011). Coopération de méthodes statistiques et symboliques pour l'adaptation non-supervisée d'un système d'étiquetage en entités nommées. In Actes de la Conférence TALN 2011, Montpellier, France. [OpenAIRE]

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171-4186.

Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., and Weischedel, R. (2004). The automatic content extraction (ace) program-tasks, data, and evaluation. In Proceedings of LREC - Volume 4, pages 837- 840.

Dupont, Y. and Tellier, I. (2014). Un reconnaisseur d'entités nommées du français. In Traitement Automatique des Langues Naturelles, TALN 2014, Marseille, France, 1-4 Juillet 2014, Démonstrations, pages 40-41.

Dupont, Y. (2017). Exploration de traits pour la reconnaissance d'entités nommées du français par apprentissage automatique. In 24e Conf'erence sur le Traitement Automatique des Langues Naturelles (TALN), page 42.

Galliano, S., Gravier, G., and Chaubard, L. (2009). The Ester 2 Evaluation Campaign for the Rich Transcription of French Radio Broadcasts. In Interspeech 2009, Brighton, UK. [OpenAIRE]

Grouin, C., Rosset, S., Zweigenbaum, P., Fort, K., Galibert, O., and Quintard, L. (2011). Proposal for an extension of traditional named entities: From guidelines to evaluation, an overview. In Proceedings of the Fifth Linguistic Annotation Workshop (LAW-V), pages 92-100, Portland, OR, June. Association for Computational Linguistics. [OpenAIRE]

Huang, Z., Xu, W., and Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991.

Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 2001, pages 282-289.

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. (2016). Neural architectures for named entity recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 260-270.

Lavergne, T., Cappé, O., and Yvon, F. (2010). Practical very large scale CRFs. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 504-513. Association for Computational Linguistics.

38 references, page 1 of 3
Abstract
Due to COVID19 pandemic, the 12th edition is cancelled. The LREC 2020 Proceedings are available at http://www.lrec-conf.org/proceedings/lrec2020/index.html; International audience; The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no large-scale French corpus with named entity annotations contain referential information, which complement the type and the span of each mention with an indication of the entity it refers to. We have manually annotated the French TreeBank with such information, after an automatic pre-annotation step. We sketch the underlying annotation guidelines and we provide a few figures about the resulting annotations.
Subjects
free text keywords: Named Entity Recognition, French, Language Modeling, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Computer Science - Computation and Language
Communities
Communities with gateway
OpenAIRE Connect image
Funded by
ANR| PRAIRIE
Project
PRAIRIE
PaRis Artificial Intelligence Research InstitutE
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-19-P3IA-0001
,
ANR| BASNUM
Project
BASNUM
Digitization and analysis of the Dictionnaire universel by Basnage de Beauval: lexicography and scientific networks
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-18-CE38-0003
38 references, page 1 of 3

Akbik, A., Blythe, D., and Vollgraf, R. (2018). Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 1638-1649.

Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., and Auli, M. (2019). Cloze-driven pretraining of self-attention networks. CoRR, abs/1903.07785. [OpenAIRE]

Bechet, F. and Charton, E. (2010). Unsupervised knowledge acquisition for extracting named entities from speech. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, USA.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135-146.

Béchet, F., Sagot, B., and Stern, R. (2011). Coopération de méthodes statistiques et symboliques pour l'adaptation non-supervisée d'un système d'étiquetage en entités nommées. In Actes de la Conférence TALN 2011, Montpellier, France. [OpenAIRE]

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171-4186.

Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., and Weischedel, R. (2004). The automatic content extraction (ace) program-tasks, data, and evaluation. In Proceedings of LREC - Volume 4, pages 837- 840.

Dupont, Y. and Tellier, I. (2014). Un reconnaisseur d'entités nommées du français. In Traitement Automatique des Langues Naturelles, TALN 2014, Marseille, France, 1-4 Juillet 2014, Démonstrations, pages 40-41.

Dupont, Y. (2017). Exploration de traits pour la reconnaissance d'entités nommées du français par apprentissage automatique. In 24e Conf'erence sur le Traitement Automatique des Langues Naturelles (TALN), page 42.

Galliano, S., Gravier, G., and Chaubard, L. (2009). The Ester 2 Evaluation Campaign for the Rich Transcription of French Radio Broadcasts. In Interspeech 2009, Brighton, UK. [OpenAIRE]

Grouin, C., Rosset, S., Zweigenbaum, P., Fort, K., Galibert, O., and Quintard, L. (2011). Proposal for an extension of traditional named entities: From guidelines to evaluation, an overview. In Proceedings of the Fifth Linguistic Annotation Workshop (LAW-V), pages 92-100, Portland, OR, June. Association for Computational Linguistics. [OpenAIRE]

Huang, Z., Xu, W., and Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991.

Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 2001, pages 282-289.

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. (2016). Neural architectures for named entity recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 260-270.

Lavergne, T., Cappé, O., and Yvon, F. (2010). Practical very large scale CRFs. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 504-513. Association for Computational Linguistics.

38 references, page 1 of 3
Any information missing or wrong?Report an Issue