Describing and Modelling Reference Chains: Tools for Corpus Annotation (including diachronic and comparative language studies) and Automatic Processing
French National Research Agency (ANR)
Funder: French National Research Agency (ANR)Project code: ANR-15-CE38-0008
Funder Contribution: 385,736 EUR
The DEMOCRAT project aims to develop linguistic research on French and in particular issues of text structuring through a detailed and contrastive analysis of reference chains (successive references to the same entity) in a varied corpus of texts covering the entire history of written French (9th-21st centuries). The project will make available to the scientific community: (i) an integrated and discursive model of referring and reference chains, (ii) an annotated corpus that can be used as a reference corpus as well as a training corpus for international evaluation campaigns on coreference, (iii) a tool for manual annotation, computer-aided annotation and annotated data management, and (iv) a system for the automatic identification of coreferences. The corpus that will be annotated in reference chains will be one million words long, i.e. about 100 000 annotated units._x000D_ _x000D_ Motivations: (i) the need for an integrated model of referring expressions that would allow the modelling of reference chains and that is all the more precise from a linguistic point of view and formal enough to allow computational applications, (ii) the need for attested linguistic data, diachronic in particular, that allow on the one hand to appreciate the variations in chains composition, and on the other hand to serve as a reference corpus for the French language, on semantic data and not only morphologic or syntactic data, (iii) the need for an unified platform for corpus management, from visualization to querying and statistic computing, including annotation of phenomena from various linguistic dimensions, and (iv) the need for a natural language processing tool for the identification of reference chains for the French language._x000D_ _x000D_ Model and corpus. In spite we can find a lot of existing descriptions of referring expressions, there does not exist any integrated description for coreference chains modelling. There does not exist either any prediction on their typologies or on their textual behavior. Moreover, there is no corpus allowing an assessment of the historical development of their composition. There is no corpus either allowing a comparison of their modes of cross-linguistic composition. There exists one corpus with anaphora annotations (ANCOR), for oral French, but there is no annotated corpus available for written French, i.e. implying long reference chains. Thus, the project aims at collecting a working corpus, relevant and diversified enough to account for the varied compositional modes of reference chains, and providing theoretical hypotheses on the notion of reference chains. These hypotheses should permit an annotation of the documents. It should also facilitate the improvement of existing annotation tools. Copyright free databases of annotated texts of Old French will be used and enriched: Corpus Représentatif des Premiers Textes Français, the Base de Français Mediéval and the Syntactic Reference Corpus of Medieval French. For contemporary French, we will exploit extracts from the ANR ORFEO corpus._x000D_ _x000D_ Tool. The design and implementation of an annotation software platform, based on TXM platform, and enriched with ANALEC’s dynamic annotation functionalities, will lead to the proposition of a new and unified framework for efficient and ergonomic annotation and for launching experiments on computer-aided annotation._x000D_ _x000D_ NLP system. To get a system for the automatic identification of reference chains, we will on the one hand use and optimize CROC (Coreference Resolution for Oral Corpus), a prototype designed and implemented at LATTICE using machine learning techniques, and on the other hand explore the design of hybrid systems, grouping several kinds of machine learning techniques and knowledge-based rules such as the ones from RefGen, a tool designed at LILPA. Then, DEMOCRAT will provide the first NLP system dedicated to the automatic detection of coreference chains for the French language. This system will participate to international campaigns._x000D_
Data Management Plans