Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Other literature type . 2023
License: CC BY
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Conference object . 2023
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

On treebank development for under-resourced languages with active annotation

Authors: Arampatzakis, Vasileios; Stamou, Vivian; Pavlidis, George; Markantonatou, Stella;

On treebank development for under-resourced languages with active annotation

Abstract

About half of the living languages and dialects, including the Greek ones, are endangered, and language loss occurs at an accelerated rate because of globalization and neocolonialism. In addition, only few languages are endowed with the resources ensuring their survival in the AI era. Saving and revitalizing the linguistic heritage of humanity has become important for maintaining global cultural diversity. Natural language processing (NLP) for endangered and under-resourced languages can be beneficial for their preservation and documentation; the latter is a challenge as endangered languages typically lack written resources. Language technologies enable the development of digital archives and linguistic resources by automating technical processes and allowing language documentation practitioners to focus on the linguistic aspects. Furthermore, NLP can provide insights into the unique linguistic features of endangered languages, aiding in their preservation and documentation. In line with these concepts and facts, the project Philotis has developed a workflow and platform to support the multimodal documentation of living languages. Advanced NLP technology is utilized to develop spoken and textual corpora (up to the level of a treebank) from raw documentation materials with a workflow that automates NLP processes and allows language documentation practitioners to focus on linguistic aspects. The Philotis workflow accommodates several documentation scenarios, and was tested on Pomak, an endangered oral Slavic language of the Balkans. Several state-of-the-art NLP tools that can be integrated into different NLP pipelines suitable for under-resourced languages were evaluated under this framework. Developing treebanks for languages is a difficult and extremely time-consuming process. Active learning approaches have been proposed to automate part of the process and reduce the total annotation duration and cost. Crucially, under-resourced languages typically provide (severely) limited amounts of data and few, if any, experts for data development and annotation. A practical active annotation strategy for such a case could be implemented in the form of an online learning approach. This approach uses randomly selected sentences for a loop including annotation prediction, manual correction, and model retraining. To implement this approach in a realistic scenario, we used 300 annotated sentences from the Pomak corpus published by Philotis on the Universal Dependencies repository. By utilizing a simple weighted summation of four potential annotation errors (lemmas, part of speech, dependency pair, and dependency label), we run several experiments of online annotation, which revealed an underlying optimal strategy. This strategy resulted in a significant decrease in the total annotation duration by 54% and a corresponding decrease in the total cost by 63% compared to manual annotation.

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 12
    download downloads 9
  • citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    Powered byBIP!BIP!
  • 12
    views
    9
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
0
Average
Average
Average
12
9
Green