SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2022Embargo end date: 01 Jan 2022Publisher:Association for Computational Linguistics (ACL)Journal:Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Authors: Zhang, Ziqiang; Zhou, Long; Ao, Junyi; Liu, Shujie; Dai, Lirong; Li, Jinyu; Wei, Furu;

doi: 10.18653/v1/2022.emnlp-main.108 , 10.48550/arxiv.2210.03730

arXiv: 2210.03730

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

- Summary
- Subjects
- Related research
  (4)
- Metrics

Abstract

The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods. In this paper, we propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Leveraging hidden-unit as an interface to align speech and text, we can decompose the speech-to-text model into a speech-to-unit model and a unit-to-text model, which can be jointly pre-trained with unpaired speech and text data respectively. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks. Experimental results show that SpeechUT gets substantial improvements over strong baselines, and achieves state-of-the-art performance on both the LibriSpeech ASR and MuST-C ST tasks. To better understand the proposed SpeechUT, detailed analyses are conducted. The code and pre-trained models are available at https://aka.ms/SpeechUT.

14 pages, accepted by EMNLP 2022

Related Organizations

University of Science and Technology of China
China (People's Republic of)
Chinese University of Hong Kong
China (People's Republic of)
Chinese University of Hong Kong, Shenzhen
China (People's Republic of)
Microsoft (United States)
United States
The Chinese University of Hong Kong
Hong Kong

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing

4 Research products, page 1 of 1

Stimulus configuration, classical conditioning, and hippocampal function.
1992IsAmongTopNSimilarDocuments
A Framework for pre-training hidden-unit conditional random fields and its extension to long short term memory networks
2017IsAmongTopNSimilarDocuments
Contribution Analysis: A Technique for Assigning Responsibilities to Hidden Units in Connectionist Networks
1989IsAmongTopNSimilarDocuments
SpeechT5 software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	23
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

23

Top 10%

Green

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

4 Research products, page 1 of 1

Stimulus configuration, classical conditioning, and hippocampal function.

A Framework for pre-training hidden-unit conditional random fields and its extension to long short term memory networks

Contribution Analysis: A Technique for Assigning Responsibilities to Hidden Units in Connectionist Networks

SpeechT5 software on GitHub