<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP

Name: iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP
Keywords: annotated corpus, French corpus, contemporary, Portuguese corpus, complexity graded corpus, Spanish corpus

Research datakeyboard_double_arrow_right Dataset 25 Jul 2024 French Publisher:Zenodo

Authors: Pintard, Alice; François, Thomas; Justine, Nagant de Deuxchaisnes; Barbosa, Sílvia; Reis, Maria Leonor; Moutinho, Michell; Monteiro, Ricardo; +7 Authors

doi: 10.5281/zenodo.13127674 , 10.5281/zenodo.12821882 , 10.5281/zenodo.12821881

iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP

- Summary
- Subjects
- Metrics

Abstract

The iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP is a collection of texts categorized by complexity level and annotated for complexity features, presented in xlsx format. These corpora were compiled, classified and annotated under the scope of the project iRead4Skills – Intelligent Reading Improvement System for Fundamental and Transversal Skills Development, funded by the European Commission (grant number: 1010094837). The project aims to enhance reading skills within the adult population by creating an intelligent system that assesses text complexity and recommends suitable reading materials to adults with low literacy skills, contributing to reducing skills gaps and facilitating access to information and culture (https://iread4skills.com/). This dataset is the result of specifically devised classification and annotation tasks, in which selected texts were organized and distributed to trainers in Adult Learning (AL) and Vocational Education Training (VET) Centres, as well as to adult students in AL and VET centres. This task was conducted via the Qualtrics platform. The Dataset 2: annotated corpora by level of complexity for FR, PT and SP is derived from the iRead4Skills Dataset 1: corpora by level of complexity for FR, PT and SP ( https://doi.org/10.5281/zenodo.10055909), which comprises written texts of various genres and complexity levels. From this collection, a subset of texts was selected for classification and annotation. This classification and annotation task aimed to provide additional data and test sets for the complexity analysis systems for the three languages of the project: French, Portuguese, and Spanish. The texts in each of the language corpora were selected taking into account the diversity of topics/domains, genres, and the reading preferences of the target audience of the iRead4Skills project. This percentage amounted to the total of 462 texts per language, which were divided by level of complexity, resulting in the following distribution: · 140 Very Easy texts · 140 Easy texts · 140 Plain texts · 42 More Complex texts. Trainers were asked to classify the texts according to the complexity levels of the project, here informally defined as: Very Easy (everyone can understand the text or most of the text). Easy (a person with less than the 9th year of schooling can understand the text or most of the text) Plain (a person with the 9th year of schooling can understand the text the first time he/she reads it) More complex (a person with the 9th year of schooling cannot understand the text the first time he/she reads it). They were also asked to annotate the parts of the texts considered complex according to various type of features, at word-level and at sentence-level (e.g., word order, sentence composition, etc.), according to following categories: Lexical/word-related features - unknown word - word too technical/specialized or archaic - complex derived word - points to a previous reference that is not obvious - word (other) Syntactic/sentence-level features - unusual word order - too much embedded secondary information - too many connectors in the same sentence - sentence (other) - other (please specify) The sets were divided in three parts in Qualtrics and, in each part, the texts are shown randomly to the annotator. Students were asked to confirm that they could read without difficulty texts adequate to their literacy level. Each set contained texts from a given level, plus one text of the level immediately above. They were also asked to annotate words or sequences of words in the text that they did not understand, according to the following categories: - difficult word - difficult part of the text The complete results and datasets are in TSV/Excel format, in pairs of two files, with one file concerning the results from the classification (trainers)/validation (students) task and one file concerning the results from the annotation task. The complete datasets will be available under creative CC BY-NC-ND 4.0

Related Organizations

Université Catholique de Louvain
Belgium
University of Santiago de Compostela
Spain
Autonomous University of Barcelona
Spain

Keywords

annotated corpus, French corpus, contemporary, Portuguese corpus, complexity graded corpus, Spanish corpus

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average