Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

Name: Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study
Keywords: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing

Peikun Chen; Sining Sun; Changhao Shan; Qing Yang; Lei Xie

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2024

Data sources: arXiv.org e-Print Archive

https://doi.org/10.21437/inter...

Article . 2024 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2024

License: CC BY

Data sources: Datacite

DBLP

Article

Data sources: DBLP

DBLP

Conference object

Data sources: DBLP

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 01 Sep 2024Embargo end date: 01 Jan 2024Publisher:ISCAJournal:Interspeech 2024

Authors: Peikun Chen; Sining Sun; Changhao Shan; Qing Yang; Lei Xie;

doi: 10.21437/interspeech.2024-1853 , 10.48550/arxiv.2406.18862

arXiv: 2406.18862

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model's contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2 datasets demonstrate the competitive performance of our streaming approach with non-streaming decoder-only counterparts.

Accepted for Interspeech 2024

Keywords

FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing

1 Research products, page 1 of 1

wenet software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

1 Research products, page 1 of 1

wenet software on GitHub