Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Presentation . 2019
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Other literature type . 2019
License: CC BY
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Presentation . 2019
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Compilation, Annotation and Analysis of Written Text Corpora: Introduction to Methods and Tools

Authors: Witt, Andreas;

Compilation, Annotation and Analysis of Written Text Corpora: Introduction to Methods and Tools

Abstract

Corpora, i.e. collections of linguistic data (texts or conversations), are a fundamental asset of digital humanities research. A ubiquitous task for linguists is to explore linguistic questions in corpora: Which forms of the verb *to be* occur in a given text? Is it common that number words are preceded by articles? What types of phrase do occur as direct object in philosophical texts? What kinds of speech acts occur in modern conversations? What words can occur as hesitation markers in spoken modern German? Are texts by women longer (or shorter) than texts by men? To answer such questions, it is necessary to select and prepare data. We will discuss different approaches to compilation and annotation of corpora. Linguistic questions can only be approached if an adequate selection of texts is available; for instance, one will not find such evidence about conversational data in parliament speeches or mathematical papers. Hence, we will first be concerned with criteria and methods for compiling corpora: selecting texts based on extra- and intralinguistic criteria, including property rights. Linguistic data must be described by metadata, so that one can find e.g. utterances by female native speakers of Southern German in the second half of the 20th century about political developments in an informal setting. Approaches to metadata will be explored. It is often useful to annotate linguistic data with respect to: pragmatic structure such as speech acts or rhetorical relations; semantic elements, e.g. named entities or points in time; syntactic structures, e.g. dependencies or phrases; or morphological relations, e.g. by assigning parts of speech (*verb*, *noun*) or lemmatizing (reducing *lemmatizing* to *lemmatize*). Some of these annotations can be carried out (partly) automatically. We will discuss what tools exist and are available. The course will be mainly concerned with textual corpora, but as searching on speech or multimodal corpora is generally carried out on the transcription and annotation layers, it will also be useful to researchers dealing with such data.

First held at the European Summer University of Cultures and Technology 2019, University of Leipzig

Keywords

Corpus analysis, Corpus linguistics, Corpus annotation

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 7
    download downloads 14
  • 7
    views
    14
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
0
Average
Average
Average
7
14
Green