Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Dec 2020Embargo end date: 01 Jan 2020 Finland English Publisher:Springer Science and Business Media LLCJournal:Machine Translation, volume 34, pages 251-286 (issn: 0922-6567, eissn: 1573-0573,

Copyright policy )Funded by:EC | MeMAD, EC | FoTran

Authors: Stig-Arne Grönroos; Sami Virpioja; Mikko Kurimo;

doi: 10.1007/s10590-020-09253-x , 10.48550/arxiv.2004.04002

arXiv: 2004.04002

handle: 10138/330171

Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation

- Summary
- Subjects
- Related research
  (4)
- Metrics

Abstract

There are several approaches for improving neural machine translation for low-resource languages: Monolingual data can be exploited via pretraining or data augmentation; Parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; Subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks -- English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish -- and one real-world task, Norwegian to North S��mi and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.

24 pages, 12 tables, 7 figures. Accepted (Nov 2020) for publication in the Machine Translation journal Special Issue on Machine Translation for Low-Resource Languages (Springer)

Country

Finland

Related Organizations

Aalto University
Finland
University of Helsinki
Finland

Keywords

FOS: Computer and information sciences, Computer and information sciences, Computer Science - Computation and Language, Multilingual machine translation, Low-resource languages, Transfer learning, Denoising sequence autoencoder, Multi-task learning, Subword segmentation, Languages, Computation and Language (cs.CL)

4 Research products, page 1 of 1

Dialect Identification of Spoken North Sámi Language Varieties Using Prosodic Features
2020IsAmongTopNSimilarDocuments
Morphological Disambiguation of South S��mi with FSTs and Neural Networks
2020IsAmongTopNSimilarDocuments
OpenNMT-py software on GitHub
IsRelatedTo
morfessor-emprune software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	4
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average