Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 18 Sep 2022Embargo end date: 01 Jan 2021Publisher:ISCAJournal:Interspeech 2022Funded by:EC | HAAWAII, EC | ATCO2

Authors: Kocour, Martin; Žmolíková, Kateřina; Ondel, Lucas; Švec, Ján; Delcroix, Marc; Ochiai, Tsubasa; Burget, Lukáš; +1 Authors

doi: 10.21437/interspeech.2022-10406 , 10.48550/arxiv.2111.00009

arXiv: 2111.00009

Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

In typical multi-talker speech recognition systems, a neural network-based acoustic model predicts senone state posteriors for each speaker. These are later used by a single-talker decoder which is applied on each speaker-specific output stream separately. In this work, we argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly. We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers. We employ a joint decoder that can make use of this uncertainty together with higher-level language information. For this, we revisit decoding algorithms used in factorial generative models in early multi-talker speech recognition systems. In contrast with these early works, we replace the GMM acoustic model with DNN, which provides greater modeling power and simplifies part of the inference. We demonstrate the advantage of joint decoding in proof of concept experiments on a mixed-TIDIGITS dataset.

submitted to Interspeech 2022

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)

1 Research products, page 1 of 1

TIDIGITS_mix software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average