Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Closed Access logo, derived from PLoS Open Access logo. This version with transparent background. http://commons.wikimedia.org/wiki/File:Closed_Access_logo_transparent.svg Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Halarrow_drop_down
image/svg+xml Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Closed Access logo, derived from PLoS Open Access logo. This version with transparent background. http://commons.wikimedia.org/wiki/File:Closed_Access_logo_transparent.svg Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao
Hal
Conference object . 2014
Data sources: Hal
versions View all 1 versions
addClaim

BaTelÒc : a Text Base for the Occitan Language

Authors: Bras, Myriam; Vergez-Couret, Marianne;

BaTelÒc : a Text Base for the Occitan Language

Abstract

Language documentation, as defined by Himmelmann (2006), aims at compiling and preserving linguistics data for studies in linguistics, literature, history, ethnology, sociology. This initiative is vital for endangered languages such as Occitan. Occitan is a romance language, spoken in southern France and in several valleys of Spain and Italy. The number of speakers is hard to estimate: according to several studies it can be evaluated between 600 000 to 2 000 000 (Martel, 2007; Sibille, 2010). Occitan is a not a unitary language, it has several varieties organized in dialects, which is not standardised as a whole. It is however written since the middle age and a very important literature has been produced. The documentation of a language concerns all its modalities, covering spoken and written language, various registers and so on. Nowadays Occitan documentation mostly consists in linguistics atlasses data (THESOC); virtual libraries from modern period (Bibliotheca Tholosana Occitana, XVI-XVIII) to contemporary one (CIEL d'Òc, XIX-XXI); text bases for middle age (Concordancier Occitan Médiéval (Peter Ricketts) and Linguistic Corpus of Medieval Gascon (Thomas Field)). The BaTelÒc project (Bras, 2006; Bras and Thomas, 2011) is a text base for modern and contemporary periods. It aims at creating wide coverage text collections by gathering written texts of literature (prose, drama and poetry) and others genres such as technical texts and newspapers. One million words have been already gathered. Enough material is available to foresee a text base of hundreds of millions of words. Language documentation offers well-documented material which might serve for linguistic analysis. From this material, linguists can extract a coherent corpus for their own specific studies (Cox, 2011). BaTelÒc is not only a project aiming at documenting the Occitan Language, it also aims at providing tools for interrogating texts. It allows to choose his own study corpus with various criteria: the author's name, the book's title, the year of publication, the genre, the dialect of Occitan, the spelling norm. It includes tools for concordance: showing forms in context (word, part of word or sequence of words). It also includes more complex enquiry with a language of regular expressions. For the future, it would include tools for searching co-occurrences and calculating frequencies. For linguistic analysis, the second step is logically to enrich the corpora with annotations. The computation of endangered language such as Occitan is very challenging. It is not possible to directly transpose existing models for resources-rich languages, partly because of the various variations (spelling and dialectal variations) and lack of standardization. Corpora are a basis for the development of dictionaries and lexicons. And on the other hand, dictionaries and lexicon are needed to support the development of corpora and their annotation. We aim at providing corpora and lexicons in order to develop basic natural language processing tools, naming OCR (Urieli & Vergez-Couret, 2013) and part-of-speech tagger dealing with variations. This two-fold effort, language documentation and natural language processing (or language technology) is the lead chosen by several researchers working on less-ressourced language in Europe, (as we can see with the workshop TALARE 2013 for Natural Language Processing of European Regional Languages we are involved in).

Countries
United States, France
Keywords

text base, corpora for less ressourced language, Corpus linguistics, Text base, Occitan Language Documentation, [SCCO.LING]Cognitive science/Linguistics, [SCCO.LING] Cognitive science/Linguistics, Occitan, Language Documentation, occitan language, [SHS.LANGUE] Humanities and Social Sciences/Linguistics, 410, 400

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green