Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Other literature type . 2024
License: CC BY
Data sources: ZENODO
ZENODO
Conference object . 2024
License: CC BY
Data sources: Datacite
ZENODO
Conference object . 2024
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Corpus annotation and dictionary linking using Wikibase

Authors: Lindemann, David;

Corpus annotation and dictionary linking using Wikibase

Abstract

This poster presents a data model and two first use cases for the representation of contents of text corpus data on Wikibase instances, including morphosyntactic, semantic and philological annotations as well as links to dictionary entries. Wikibase (cf. Diefenbach et al. 2021), an extension of MediaWiki, is the software that underlies Wikidata (Vrandečić & Krötzsch 2014), an exceptionally large crowdsourced queriable Knowledge Graph, which includes nodes for ontological concepts, on the one hand, and for lexemes, lexeme senses and lexeme forms, on the other, together with annotations to and relations between them. The use case for which the model has been proposed is documents that belong to the Basque Historical Corpus, although we claim that it can serve in other contexts, too. That corpus contains literature text written in Basque from before 1900, and today exists in several versions stored in separated and incompatible data siloes (based on relational databases) and made available through different online user front ends. One version displays historical documents in a standardized orthography; another version, based on the former, allows for lemma-based searches, and a third version contains morphosyntactic annotations (part of speech, inflection form descriptions, and corresponding lemmata), and some texts are also published elsewhere, sometimes in an electronic format, together with philological annotations. A second use case is an experiment for linking a Serbian literature corpus in NIF format to a Serbian dictionary in Ontolex-Lemon. Heavily inspired by the latest trends in the field of Linguistic Linked Open Data, we model a corpus token as node in a knowledge graph, and link it (1) to the respective paragraph (Basque) or token (Serbian) in the source document ; (2) to a lexeme node, which is annotated with the standard lemma; (3) to a lexical form associated to that lexeme, which is annotated with the grammatical features describing the form; (4) to a lexical sense associated to that lexeme, which is annotated with a sense gloss; (5) to an ontology concept representing the word sense; and (6), to a text chain containing philological annotations. Furthermore, we represent token spans as separate nodes; these are linked to the contained tokens, and to annotations that apply to the whole span. We implement and populate the model on our own Wikibase instance hosted on Wikibase Cloud. Core classes and properties used on a Wikibase by default for describing lexemes deploy Ontolex-Lemon (McCrae et al. 2017), the W3C-recommended model for lexical data, so that the created datasets are compatible with the Linguistic Linked Open Data Cloud. We define properties that describe corpus tokens as equivalent to NIF, a standard for corpus annotation (Hellmann et al. 2013). We are currently populating the proposed model with tokens from a 1737 Basque manuscript, the transcription of which has been carried out on Wikisource, and inserting annotations of the above described types including philological annotations by Lakarra (1985), as well as direct links to the corresponding paragraph in the manuscript transcription on the Wikisource platform.

Related Organizations
Keywords

Corpus annotation, Wikibase, Linked Open Data

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green