Advanced search in
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
6 Research products, page 1 of 1

  • Publications
  • 2017-2021
  • Open Access
  • DE
  • CLARIN
  • Digital Humanities and Cultural Heritage

Relevance
arrow_drop_down
  • Open Access
    Authors: 
    Matthias Huck; Aleš Tamchyna; Ondrej Bojar; Alexander Fraser;
    Publisher: Association for Computational Linguistics
    Country: Czech Republic
    Project: EC | QT21 (645452), EC | HimL (644402), EC | DASMT (640550)

    Translating into morphologically rich languages is difficult. Although the coverage of lemmas may be reasonable, many morphological variants cannot be learned from the training data. We present a statistical translation system that is able to produce these inflected word forms. Different from most previous work, we do not separate morphological prediction from lexical choice into two consecutive steps. Our approach is novel in that it is integrated in decoding and takes advantage of context information from both the source language and the target language sides.

  • Open Access
    Authors: 
    Jan Auracher; Mathias Scharinger; Winfried Menninghaus;
    Country: Germany

    We tested the hypothesis that phonosemantic iconicity--i.e., a motivated resonance of sound and meaning--might not only be found on the level of individual words or entire texts, but also in word combinations such that the meaning of a target word is iconically expressed, or highlighted, in the phonetic properties of its immediate verbal context. To this end, we extracted single lines from German poems that all include a word designating high or low dominance, such as large or small, strong or weak, etc. Based on insights from previous studies, we expected to find more vowels with a relatively short distance between the first two formants (low formant dispersion) in the immediate context of words expressing high physical or social dominance than in the context of words expressing low dominance. Our findings support this hypothesis, suggesting that neighboring words can form iconic dyads in which the meaning of one word is sound-iconically reflected in the phonetic properties of adjacent words. The construct of a contiguity-based phono-semantic iconicity opens many venues for future research well beyond lines extracted from poems.

  • Publication . Preprint . Article . 2020
    Open Access English
    Authors: 
    Kocmi, Tom; Limisiewicz, Tomasz; Stanovsky, Gabriel;
    Project: EC | Bergamot (825303)

    Gender bias in machine translation can manifest when choosing gender inflections based on spurious gender correlations. For example, always translating doctors as men and nurses as women. This can be particularly harmful as models become more popular and deployed within commercial systems. Our work presents the largest evidence for the phenomenon in more than 19 systems submitted to the WMT over four diverse target languages: Czech, German, Polish, and Russian. To achieve this, we use WinoMT, a recent automatic test suite which examines gender coreference and bias when translating from English to languages with grammatical gender. We extend WinoMT to handle two new languages tested in WMT: Polish and Czech. We find that all systems consistently use spurious correlations in the data rather than meaningful contextual information. Accepted WMT20

  • Open Access English
    Authors: 
    Tiepmar, Jochen;
    Country: Germany

    Einer der bestimmenden Faktoren moderner Gesellschaften ist die fortlaufende Digitalisierung von Informationen und Resourcen. Dieser Trend spiegelt sich in heutiger Forschung wider und hat starken Einfluss auf akademische und industrielle Projekte. Es ist nahezu unmöglich, ein modernes Projekt aufzusetzen, welches keinerlei digitale Aspekte beinhaltet und viele Projekte werden mit dem alleinigen Zweck der Digitalisierung eines Teils der Welt ins Leben gerufen. Dieser Trend führt zur Entstehung neuer Forschungsfelder an den Schnittstellen zwischen der analogen Welt -- beispielsweise den Geisteswissenschaften -- und der Digitalen -- beispielsweise der Informatik. Eine davon ist das für diese Arbeit interessante Gebiet der Digital Humanities. Dabei werden komplexe Forschungsfragen, -techniken und -prinzipien verbunden, die sich unabhängig voneinander entwickelten. Viel Mühe ist nötig, um die Kommunikation zwischen deren Konzepte zu definieren um Missverständnisse und Fehleinschätzungen zu vermeiden. Dieser Prozess der Brückenbildung ist eine zentrale Aufgabe der neu entstehenden Forschungsfelder. Diese Arbeit schlägt eine solche Brücke für die textorientierten Digital Humanities vor. Diese Lösung basiert auf einem Referenzsystem für digitalen Text, welches in den Geisteswissenschaften spezifiziert und im Rahmen dieser Arbeit zu einem Datenkommunikationsprotokoll für die Informatik uminterpretiert wurde: dem Canonical Text Service (CTS) Protokoll. One of the defining factors of modern societies is the ongoing digitization of information, resources and in many ways even life itself. This trend is obviously also reflected in today's research environments and heavily influences the direction in which academic and industrial projects are headed. It is borderline impossible to set up a modern project without including digital aspects and many projects are even set up for the sole purpose of digitizing a specific part of the world. One of the side effects of this trend is the emergence of new research fields at the intersection points between the analog world -- represented for example by the humanities -- and the digital world -- represented for example by computer science. One set of such research fields are the digital humanities, the area of interest for this work. In the process of this development, complex research questions, techniques, and principles are aligned next to each other that were developed independently from another. A lot of work has to go into defining communication between the concepts to prevent misunderstandings and misconceptions on both sides. This bridge building process is one of the major tasks that must be done by the newly developed research fields. This work proposes such a bridge for the text-oriented digital humanities based on a digital text reference system that was previously developed in the humanities and is in this work reinterpreted as a data communication protocol for computer science: The Canonical Text Service (CTS) protocol.

  • Publication . Article . Preprint . Conference object . 2019
    Open Access English
    Authors: 
    Dan Kondratyuk; Milan Straka;

    We present UDify, a multilingual multi-task model capable of accurately predicting universal part-of-speech, morphological features, lemmas, and dependency trees simultaneously for all 124 Universal Dependencies treebanks across 75 languages. By leveraging a multilingual BERT self-attention model pretrained on 104 languages, we found that fine-tuning it on all datasets concatenated together with simple softmax classifiers for each UD task can result in state-of-the-art UPOS, UFeats, Lemmas, UAS, and LAS scores, without requiring any recurrent or language-specific components. We evaluate UDify for multilingual learning, showing that low-resource languages benefit the most from cross-linguistic annotations. We also evaluate for zero-shot learning, with results suggesting that multilingual training provides strong UD predictions even for languages that neither UDify nor BERT have ever been trained on. Code for UDify is available at https://github.com/hyperparticle/udify. Accepted for publication at EMNLP 2019. 17 pages, 6 figures

  • Publication . Book . 2017
    Open Access German
    Authors: 
    Helbig, Kerstin; Fromm, Niels; Riesenweber, Christina; Schlegel, Birgit; Schobert, Dagmar; Voigt, Michaela; Winterhalter, Christian;
    Publisher: Humboldt-Universität zu Berlin, Universitätsbibliothek der Humboldt-Universität
    Country: Germany

    Im Spätsommer 2016 begannen die Planungen der Open-Access-Teams der Freien Universität, der Humboldt-Universität und der Technischen Universität Berlin für die internationale Open Access Week 2016. In einem Call for Posters wurden Berliner und Brandenburger Open-Access-Projekte dazu aufgerufen, ihre Aktivitäten in einer Ausstellung vorzustellen. Die Publikation dokumentiert die Posterausstellung und Podiumsdiskussion zur Open Access Week 2016. Sie enthält 30 Poster inklusive Beschreibungen und Links zu den Originalversionen in Druckqualität, ergänzt um Fotos einer Abendveranstaltung bei Wikimedia Deutschland. In late summer 2016 the open access teams of the Freie Universität, the Humboldt-Universität and the Technische Universität Berlin started their plans for the international Open Access Week 2016. In a call for posters, open access projects from Berlin and Brandenburg were requested to present their activities in a poster exhibition. The publication documents the poster exhibition and panel discussion during the Open Access Week 2016. It contains all posters including abstracts and links to the original versions in print quality, supplemented by photos from the Wikimedia event. Not Reviewed

Advanced search in
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
6 Research products, page 1 of 1
  • Open Access
    Authors: 
    Matthias Huck; Aleš Tamchyna; Ondrej Bojar; Alexander Fraser;
    Publisher: Association for Computational Linguistics
    Country: Czech Republic
    Project: EC | QT21 (645452), EC | HimL (644402), EC | DASMT (640550)

    Translating into morphologically rich languages is difficult. Although the coverage of lemmas may be reasonable, many morphological variants cannot be learned from the training data. We present a statistical translation system that is able to produce these inflected word forms. Different from most previous work, we do not separate morphological prediction from lexical choice into two consecutive steps. Our approach is novel in that it is integrated in decoding and takes advantage of context information from both the source language and the target language sides.

  • Open Access
    Authors: 
    Jan Auracher; Mathias Scharinger; Winfried Menninghaus;
    Country: Germany

    We tested the hypothesis that phonosemantic iconicity--i.e., a motivated resonance of sound and meaning--might not only be found on the level of individual words or entire texts, but also in word combinations such that the meaning of a target word is iconically expressed, or highlighted, in the phonetic properties of its immediate verbal context. To this end, we extracted single lines from German poems that all include a word designating high or low dominance, such as large or small, strong or weak, etc. Based on insights from previous studies, we expected to find more vowels with a relatively short distance between the first two formants (low formant dispersion) in the immediate context of words expressing high physical or social dominance than in the context of words expressing low dominance. Our findings support this hypothesis, suggesting that neighboring words can form iconic dyads in which the meaning of one word is sound-iconically reflected in the phonetic properties of adjacent words. The construct of a contiguity-based phono-semantic iconicity opens many venues for future research well beyond lines extracted from poems.

  • Publication . Preprint . Article . 2020
    Open Access English
    Authors: 
    Kocmi, Tom; Limisiewicz, Tomasz; Stanovsky, Gabriel;
    Project: EC | Bergamot (825303)

    Gender bias in machine translation can manifest when choosing gender inflections based on spurious gender correlations. For example, always translating doctors as men and nurses as women. This can be particularly harmful as models become more popular and deployed within commercial systems. Our work presents the largest evidence for the phenomenon in more than 19 systems submitted to the WMT over four diverse target languages: Czech, German, Polish, and Russian. To achieve this, we use WinoMT, a recent automatic test suite which examines gender coreference and bias when translating from English to languages with grammatical gender. We extend WinoMT to handle two new languages tested in WMT: Polish and Czech. We find that all systems consistently use spurious correlations in the data rather than meaningful contextual information. Accepted WMT20

  • Open Access English
    Authors: 
    Tiepmar, Jochen;
    Country: Germany

    Einer der bestimmenden Faktoren moderner Gesellschaften ist die fortlaufende Digitalisierung von Informationen und Resourcen. Dieser Trend spiegelt sich in heutiger Forschung wider und hat starken Einfluss auf akademische und industrielle Projekte. Es ist nahezu unmöglich, ein modernes Projekt aufzusetzen, welches keinerlei digitale Aspekte beinhaltet und viele Projekte werden mit dem alleinigen Zweck der Digitalisierung eines Teils der Welt ins Leben gerufen. Dieser Trend führt zur Entstehung neuer Forschungsfelder an den Schnittstellen zwischen der analogen Welt -- beispielsweise den Geisteswissenschaften -- und der Digitalen -- beispielsweise der Informatik. Eine davon ist das für diese Arbeit interessante Gebiet der Digital Humanities. Dabei werden komplexe Forschungsfragen, -techniken und -prinzipien verbunden, die sich unabhängig voneinander entwickelten. Viel Mühe ist nötig, um die Kommunikation zwischen deren Konzepte zu definieren um Missverständnisse und Fehleinschätzungen zu vermeiden. Dieser Prozess der Brückenbildung ist eine zentrale Aufgabe der neu entstehenden Forschungsfelder. Diese Arbeit schlägt eine solche Brücke für die textorientierten Digital Humanities vor. Diese Lösung basiert auf einem Referenzsystem für digitalen Text, welches in den Geisteswissenschaften spezifiziert und im Rahmen dieser Arbeit zu einem Datenkommunikationsprotokoll für die Informatik uminterpretiert wurde: dem Canonical Text Service (CTS) Protokoll. One of the defining factors of modern societies is the ongoing digitization of information, resources and in many ways even life itself. This trend is obviously also reflected in today's research environments and heavily influences the direction in which academic and industrial projects are headed. It is borderline impossible to set up a modern project without including digital aspects and many projects are even set up for the sole purpose of digitizing a specific part of the world. One of the side effects of this trend is the emergence of new research fields at the intersection points between the analog world -- represented for example by the humanities -- and the digital world -- represented for example by computer science. One set of such research fields are the digital humanities, the area of interest for this work. In the process of this development, complex research questions, techniques, and principles are aligned next to each other that were developed independently from another. A lot of work has to go into defining communication between the concepts to prevent misunderstandings and misconceptions on both sides. This bridge building process is one of the major tasks that must be done by the newly developed research fields. This work proposes such a bridge for the text-oriented digital humanities based on a digital text reference system that was previously developed in the humanities and is in this work reinterpreted as a data communication protocol for computer science: The Canonical Text Service (CTS) protocol.

  • Publication . Article . Preprint . Conference object . 2019
    Open Access English
    Authors: 
    Dan Kondratyuk; Milan Straka;

    We present UDify, a multilingual multi-task model capable of accurately predicting universal part-of-speech, morphological features, lemmas, and dependency trees simultaneously for all 124 Universal Dependencies treebanks across 75 languages. By leveraging a multilingual BERT self-attention model pretrained on 104 languages, we found that fine-tuning it on all datasets concatenated together with simple softmax classifiers for each UD task can result in state-of-the-art UPOS, UFeats, Lemmas, UAS, and LAS scores, without requiring any recurrent or language-specific components. We evaluate UDify for multilingual learning, showing that low-resource languages benefit the most from cross-linguistic annotations. We also evaluate for zero-shot learning, with results suggesting that multilingual training provides strong UD predictions even for languages that neither UDify nor BERT have ever been trained on. Code for UDify is available at https://github.com/hyperparticle/udify. Accepted for publication at EMNLP 2019. 17 pages, 6 figures

  • Publication . Book . 2017
    Open Access German
    Authors: 
    Helbig, Kerstin; Fromm, Niels; Riesenweber, Christina; Schlegel, Birgit; Schobert, Dagmar; Voigt, Michaela; Winterhalter, Christian;
    Publisher: Humboldt-Universität zu Berlin, Universitätsbibliothek der Humboldt-Universität
    Country: Germany

    Im Spätsommer 2016 begannen die Planungen der Open-Access-Teams der Freien Universität, der Humboldt-Universität und der Technischen Universität Berlin für die internationale Open Access Week 2016. In einem Call for Posters wurden Berliner und Brandenburger Open-Access-Projekte dazu aufgerufen, ihre Aktivitäten in einer Ausstellung vorzustellen. Die Publikation dokumentiert die Posterausstellung und Podiumsdiskussion zur Open Access Week 2016. Sie enthält 30 Poster inklusive Beschreibungen und Links zu den Originalversionen in Druckqualität, ergänzt um Fotos einer Abendveranstaltung bei Wikimedia Deutschland. In late summer 2016 the open access teams of the Freie Universität, the Humboldt-Universität and the Technische Universität Berlin started their plans for the international Open Access Week 2016. In a call for posters, open access projects from Berlin and Brandenburg were requested to present their activities in a poster exhibition. The publication documents the poster exhibition and panel discussion during the Open Access Week 2016. It contains all posters including abstracts and links to the original versions in print quality, supplemented by photos from the Wikimedia event. Not Reviewed

Send a message
How can we help?
We usually respond in a few hours.