Back Translation for Speech-to-text Translation Without Transcripts

Name: Back Translation for Speech-to-text Translation Without Transcripts
Keywords: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), I.2.7, FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing

Qingkai Fang; Yang Feng 0004

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2023

Data sources: arXiv.org e-Print Archive

https://doi.org/10.18653/v1/20...

Article . 2023 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2023

License: CC BY NC ND

Data sources: Datacite

DBLP

Conference object

Data sources: DBLP

DBLP

Article

Data sources: DBLP

Back Translation for Speech-to-text Translation Without Transcripts

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 01 Jan 2023Embargo end date: 01 Jan 2023Publisher:Association for Computational Linguistics (ACL)Journal:Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Authors: Qingkai Fang; Yang Feng 0004;

doi: 10.18653/v1/2023.acl-long.251 , 10.48550/arxiv.2305.08709

arXiv: 2305.08709

Back Translation for Speech-to-text Translation Without Transcripts

- Summary
- Subjects
- Related research
  (6)
- Metrics

Abstract

The success of end-to-end speech-to-text translation (ST) is often achieved by utilizing source transcripts, e.g., by pre-training with automatic speech recognition (ASR) and machine translation (MT) tasks, or by introducing additional ASR and MT data. Unfortunately, transcripts are only sometimes available since numerous unwritten languages exist worldwide. In this paper, we aim to utilize large amounts of target-side monolingual data to enhance ST without transcripts. Motivated by the remarkable success of back translation in MT, we develop a back translation algorithm for ST (BT4ST) to synthesize pseudo ST data from monolingual target data. To ease the challenges posed by short-to-long generation and one-to-many mapping, we introduce self-supervised discrete units and achieve back translation by cascading a target-to-unit model and a unit-to-speech model. With our synthetic ST data, we achieve an average boost of 2.3 BLEU on MuST-C En-De, En-Fr, and En-Es datasets. More experiments show that our method is especially effective in low-resource scenarios.

ACL 2023 main conference

Related Organizations

University of Chinese Academy of Sciences
China (People's Republic of)
Chinese Academy of Sciences
China (People's Republic of)
Institute Of Computing Technology
China (People's Republic of)
INSTITUTE OF COMPUTING TECHNOLOGY,CHINESE ACADEMY OF SCIENCES
China (People's Republic of)

Keywords

FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), I.2.7, FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing

6 Research products, page 1 of 1

sentencepiece software on GitHub
IsRelatedTo
whisper software on GitHub
IsRelatedTo
dvector software on GitHub
IsRelatedTo
fairseq software on GitHub
IsRelatedTo
fairseq software on GitHub
IsRelatedTo
sacrebleu software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	4
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

4

Top 10%

Average

Green

Back Translation for Speech-to-text Translation Without Transcripts

Back Translation for Speech-to-text Translation Without Transcripts

6 Research products, page 1 of 1

sentencepiece software on GitHub

whisper software on GitHub

dvector software on GitHub

fairseq software on GitHub

fairseq software on GitHub

sacrebleu software on GitHub