Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

Contribution for newspaper or weekly magazine, Preprint English OPEN
Bollmann, Marcel; Søgaard, Anders;
(2016)
  • Subject: Computer Science - Computation and Language

Natural-language processing of historical documents is complicated by the abundance of variant spellings and lack of annotated data. A common approach is to normalize the spelling of historical words to modern forms. We explore the suitability of a deep neural network a... View more
  • References (10)

    Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8):707-710.

    Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task sequence to sequence learning. arXiv:1511.06114v4.

    Naoaki Okazaki. 2007. CRFsuite: http://www.chokkan.org/software/crfsuite/.

    Eva Pettersson, Bea´ta Megyesi, and Jo¨ rg Tiedemann. 2013. An SMT approach to automatic annotation of historical text. In Proceedings of the NODALIDA Workshop on Computational Historical Linguistics, Oslo, Norway.

    Michael Piotrowski. 2012. Natural Language Processing for Historical Texts. Number 17 in Synthesis Lectures on Human Language Technologies. Morgan & Claypool, San Rafael, CA.

    Jordi Porta, Jose´-Luis Sancho, and Javier Go´ mez. 2013. Edit transducers for spelling variation in Old Spanish. In Proceedings of the NODALIDA Workshop on Computational Historical Linguistics, Oslo, Norway.

    Yves Scherrer and Tomazˇ Erjavec. 2013. Modernizing historical Slovene words with character-based SMT. In Proceedings of the 4th Biennial Workshop on Balto-Slavic Natural Language Processing, Sofia, Bulgaria.

    Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS 2014), number 27, pages 3104-3112.

    Felipe Sa´nchez-Mart´ınez, Isabel Mart´ınez-Sempere, Xavier Ivars-Ribes, and Rafael C. Carrasco. 2013. An open diachronic corpus of historical Spanish: annotation criteria and automatic modernisation of spelling. arXiv:1306.3692v1, 06.

    Martijn Wieling, Jelena Prokic´, and John Nerbonne. 2009. Evaluating the pairwise string alignment of pronunciations. In Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH - SHELT&R 2009), pages 26-34, Athens, Greece.

  • Related Research Results (1)
    Inferred by OpenAIRE
    software
    norma software on GitHub
    72%
  • Related Organizations (3)
  • Metrics
Share - Bookmark