Linguatec Tolosa Treebank for Occitan Linguatec Tolosa Treebank is the first dependency treebank for Occitan, developed as part of the EFA 227/16 LINGUATEC Project, financed by the POCTEFA Interreg European funds. The current version of the treebank contains 13K tokens annotated for PoS tags, lemmas and syntactic dependencies. Linguistic annotation follows Universal Dependencies guidelines (https://universaldependencies.org/#language-u). A detailed corpus description is provided in the description file. A subset of texts was doubly annotated and these annotations were adjudicated in order to provide the final annotation. These texts are therefore the most suited to be used as test files in NLP experiments. The corpus files are stored in the ConLL-U format. Each sentence is preceded by a sentence ID and the original, non-tokenized text of the sentence. The annotation is provided in a column-based format defined as follows: 1. ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens. 2. FORM: Word form or punctuation symbol. 3. LEMMA: Lemma or stem of word form. 4. UPOS: Universal part-of-speech tag. 5. XPOS: Language-specific part-of-speech tag; underscore if not available. 6. FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available. 7. HEAD: Head of the current word, which is either a value of ID or zero (0). 8. DEPREL: Universal dependency relation to the HEAD 9. DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs. 10. MISC: Any other annotation. The texts are distributed under the Creative Commons BY-NC-SA 4.0 license (https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).
This corpus is developed as part of the EFA 227/16 LINGUATEC Project, financed by the POCTEFA Interreg European funds.