publication . Preprint . Other literature type . Conference object . 2019

Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade

Pino, Juan; Puzon, Liezl; Gu, Jiatao; Ma, Xutai; McCarthy, Arya D.; Gopinath, Deepak;
Open Access English
  • Published: 02 Nov 2019
Abstract
For automatic speech translation (AST), end-to-end approaches are outperformed by cascaded models that transcribe with automatic speech recognition (ASR), then translate with machine translation (MT). A major cause of the performance gap is that, while existing AST corpora are small, massive datasets exist for both the ASR and MT subsystems. In this work, we evaluate several data augmentation and pretraining approaches for AST, by comparing all on the same datasets. Simple data augmentation by translating ASR transcripts proves most effective on the English--French augmented LibriSpeech dataset, closing the performance gap from 8.2 to 1.4 BLEU, compared to a ver...
Subjects
free text keywords: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Download fromView all 4 versions
Zenodo
Other literature type . 2019
Provider: Datacite
ZENODO
Conference object . 2019
Provider: ZENODO
Zenodo
Other literature type . 2019
Provider: Datacite
29 references, page 1 of 2

[4] R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen, “Sequence-to-sequence models can directly translate foreign speech,” in Proc. Interspeech 2017, 2017, pp. 2625-2629. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2017-503

[5] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577-585.

[6] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960- 4964.

[7] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206-5210.

[8] O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, H. Saint-Amand, R. Soricut, L. Specia, and A. Tamchyna, “Findings of the 2014 workshop on statistical machine translation,” in Proceedings of the Ninth Workshop on Statistical Machine Translation. Baltimore, Maryland, USA: Association for Computational Linguistics, June 2014, pp. 12-58. [Online]. Available: http://www.aclweb.org/anthology/W/W14/W14-3302 [OpenAIRE]

[9] A. Kocabiyikoglu, L. Besacier, and O. Kraif, “Augmenting Librispeech with French translations: A multimodal corpus for direct speech translation evaluation,” in LREC (Language Resources and Evaluation Conference), 2018. [OpenAIRE]

[10] M. A. Di Gangi, R. Cattoni, L. Bentivogli, M. Negri, and M. Turchi, “MuST-C: a multilingual speech translation corpus,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Minneapolis, MN, USA, June 2019.

[11] A. Be´rard, L. Besacier, A. C. Kocabiyikoglu, and O. Pietquin, “End-to-end automatic speech translation of audiobooks,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 6224-6228. [OpenAIRE]

[12] M. Sperber, G. Neubig, J. Niehues, and A. Waibel, “Attention-passing models for robust and dataefficient end-to-end speech translation,” arXiv preprint arXiv:1904.07209, 2019. [OpenAIRE]

[13] Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao, C.-C. Chiu, N. Ari, S. Laurenzo, and Y. Wu, “Leveraging weakly supervised data to improve end-to-end speech-to-text translation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 7180- 7184.

[14] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018., 2018, pp. 2207-2211. [Online]. Available: https://doi.org/10. 21437/Interspeech.2018-1456

[15] A. Mohamed, D. Okhonko, and L. Zettlemoyer, “Transformers with convolutional context for ASR,” arXiv preprint arXiv:1904.11660, 2019. [OpenAIRE]

[16] R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” arXiv preprint arXiv:1511.06709, 2015. [OpenAIRE]

[17] T. Hayashi, S. Watanabe, Y. Zhang, T. Toda, T. Hori, R. Astudillo, and K. Takeda, “Back-translation-style data augmentation for end-to-end ASR,” in 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings, ser. 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2 2019, pp. 426-433.

[18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9,

29 references, page 1 of 2
Abstract
For automatic speech translation (AST), end-to-end approaches are outperformed by cascaded models that transcribe with automatic speech recognition (ASR), then translate with machine translation (MT). A major cause of the performance gap is that, while existing AST corpora are small, massive datasets exist for both the ASR and MT subsystems. In this work, we evaluate several data augmentation and pretraining approaches for AST, by comparing all on the same datasets. Simple data augmentation by translating ASR transcripts proves most effective on the English--French augmented LibriSpeech dataset, closing the performance gap from 8.2 to 1.4 BLEU, compared to a ver...
Subjects
free text keywords: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Download fromView all 4 versions
Zenodo
Other literature type . 2019
Provider: Datacite
ZENODO
Conference object . 2019
Provider: ZENODO
Zenodo
Other literature type . 2019
Provider: Datacite
29 references, page 1 of 2

[4] R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen, “Sequence-to-sequence models can directly translate foreign speech,” in Proc. Interspeech 2017, 2017, pp. 2625-2629. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2017-503

[5] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577-585.

[6] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960- 4964.

[7] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206-5210.

[8] O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, H. Saint-Amand, R. Soricut, L. Specia, and A. Tamchyna, “Findings of the 2014 workshop on statistical machine translation,” in Proceedings of the Ninth Workshop on Statistical Machine Translation. Baltimore, Maryland, USA: Association for Computational Linguistics, June 2014, pp. 12-58. [Online]. Available: http://www.aclweb.org/anthology/W/W14/W14-3302 [OpenAIRE]

[9] A. Kocabiyikoglu, L. Besacier, and O. Kraif, “Augmenting Librispeech with French translations: A multimodal corpus for direct speech translation evaluation,” in LREC (Language Resources and Evaluation Conference), 2018. [OpenAIRE]

[10] M. A. Di Gangi, R. Cattoni, L. Bentivogli, M. Negri, and M. Turchi, “MuST-C: a multilingual speech translation corpus,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Minneapolis, MN, USA, June 2019.

[11] A. Be´rard, L. Besacier, A. C. Kocabiyikoglu, and O. Pietquin, “End-to-end automatic speech translation of audiobooks,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 6224-6228. [OpenAIRE]

[12] M. Sperber, G. Neubig, J. Niehues, and A. Waibel, “Attention-passing models for robust and dataefficient end-to-end speech translation,” arXiv preprint arXiv:1904.07209, 2019. [OpenAIRE]

[13] Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao, C.-C. Chiu, N. Ari, S. Laurenzo, and Y. Wu, “Leveraging weakly supervised data to improve end-to-end speech-to-text translation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 7180- 7184.

[14] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018., 2018, pp. 2207-2211. [Online]. Available: https://doi.org/10. 21437/Interspeech.2018-1456

[15] A. Mohamed, D. Okhonko, and L. Zettlemoyer, “Transformers with convolutional context for ASR,” arXiv preprint arXiv:1904.11660, 2019. [OpenAIRE]

[16] R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” arXiv preprint arXiv:1511.06709, 2015. [OpenAIRE]

[17] T. Hayashi, S. Watanabe, Y. Zhang, T. Toda, T. Hori, R. Astudillo, and K. Takeda, “Back-translation-style data augmentation for end-to-end ASR,” in 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings, ser. 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2 2019, pp. 426-433.

[18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9,

29 references, page 1 of 2
Any information missing or wrong?Report an Issue