Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification

Article English OPEN
Babych, B (2016)
  • Publisher: University of Latvia

This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from non-parallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries beyond the coverage of traditionally used aligned parallel corpora, which can be used for finding translation equivalents for the ‘long tail’ in Zipfian distribution: low-frequency and usually unambiguous lexical items in closely-related languages (many of those often under-resourced). Graphonological Levenshtein edit distance relies on editing hierarchical representations of phonological features for graphemes (graphonological representations) and improves on phonological edit distance proposed for measuring dialectological variation. Graphonological edit distance works directly with character strings and does not require an intermediate stage of phonological transcription, exploiting the advantages of historical and morphological principles of orthography, which are obscured if only phonetic principle is applied. Difficulties associated with plain feature representations (unstructured feature sets or vectors) are addressed by using linguistically-motivated feature hierarchy that restricts matching of lower-level graphonological features when higher-level features are not matched. The paper presents an evaluation of the graphonological edit distance in comparison with the traditional Levenshtein edit distance from the perspective of its usefulness for the task of automated cognate identification. It discusses the advantages of the proposed method, which can be used for morphology induction, for robust transliteration across different alphabets (Latin, Cyrillic, Arabic, etc.) and robust identification of words with non-standard or distorted spelling, e.g., in user-generated content on the web such as posts on social media, blogs and comments. Software for calculating the modified feature-based Levenshtein distance, and the corresponding graphonological feature representations (vectors and the hierarchies of graphemes’ features) are released on the author’s webpage: http://corpus.leeds.ac.uk/bogdan/phonologylevenshtein/. Features are currently available for Latin and Cyrillic alphabets and will be extended to other alphabets and languages.
  • References (30)
    30 references, page 1 of 3

    Anderson, S. R. (1985). Phonology in the twentieth century: Theories of rules and theories of representations. University of Chicago Press.

    Babych, B., Elliott, D., Hartley, A. (2004, August). Extending MT evaluation tools with translation complexity metrics. In Proceedings of the 20th international conference on Computational Linguistics (p. 106). Association for Computational Linguistics.

    Babych, B., Hartley, A., Sharoff, S. (2007). Translating from under-resourced languages: comparing direct transfer against pivot translation. Proceedings of MT Summit XI, Copenhagen, Denmark.

    Beinborn, L., Zesch, T., Gurevych, I. (2013). Cognate Production using Character-based Machine Translation. In IJCNLP (pp. 883-891).

    Bergsma, S., Kondrak, G. (2007, September). Multilingual cognate identification using integer linear programming. In RANLP Workshop on Acquisition and Management of Multilingual Lexicons.

    Chomsky, N., Halle, M. (1968). The sound pattern of English. Harper & Row Publishers: New York, London.

    Ciobanu, A. M., Dinu, L. P. (2014). Automatic Detection of Cognates Using Orthographic Alignment. In ACL (2) (pp. 99-105).

    Comrie, B. , Corbett, G., Eds. (1993). The Slavonic Languages. Routledge: London, New York.

    Eberle, K., Geiß, J., Ginestí-Rosell, M., Babych, B., Hartley, A., Rapp, R., Sharoff, S & Thomas, M. (2012, April). Design of a hybrid high quality machine translation system. In Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra) (pp. 101- 112). Association for Computational Linguistics.

    Enright, J., Kondrak, G. (2007) A fast method for parallel document identification. Proceedings of Human Language Technologies: The Conference of the North American Chapter of the Association for Computational Linguistics companion volume, pp 29-32, Rochester, NY, April 2007.

  • Metrics
    0
    views in OpenAIRE
    0
    views in local repository
    20
    downloads in local repository

    The information is available from the following content providers:

    From Number Of Views Number Of Downloads
    White Rose Research Online - IRUS-UK 0 20
Share - Bookmark