Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 May 2018Embargo end date: 01 Jan 2016 France Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 40, pages 1,086-1,099 (issn: 0162-8828, eissn: 2160-9292,

Copyright policy )Funded by:EC | VHIA

Authors: Gebru, Israel; Ba, Sileye; Li, Xiaofei; Horaud, Radu;

doi: 10.1109/tpami.2017.2648793 , 10.48550/arxiv.1603.09725

pmid: 28103192

arXiv: 1603.09725

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

- Summary
- Subjects
- Metrics

Abstract

Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants engaged in multi-party interaction while they move around and turn their heads towards the other participants rather than facing the cameras and the microphones. Multiple-person visual tracking is combined with multiple speech-source localization in order to tackle the speech-to-person association problem. The latter is solved within a novel audio-visual fusion method on the following grounds: binaural spectral features are first extracted from a microphone pair, then a supervised audio-visual alignment technique maps these features onto an image, and finally a semi-supervised clustering method assigns binaural spectral features to visible persons. The main advantage of this method over previous work is that it processes in a principled way speech signals uttered simultaneously by multiple persons. The diarization itself is cast into a latent-variable temporal graphical model that infers speaker identities and speech turns, based on the output of an audio-visual association process, executed at each time slice, and on the dynamics of the diarization variable itself. The proposed formulation yields an efficient exact inference procedure. A novel dataset, that contains audio-visual training data as well as a number of scenarios involving several participants engaged in formal and informal dialogue, is introduced. The proposed method is thoroughly tested and benchmarked with respect to several state-of-the art diarization algorithms.

14 pages, 6 figures, 5 tables

Country

France

Related Organizations

French Institute for Research in Computer Science and Automation
France
Grenoble INP - UGA
France
Institute of Electrical and Electronics Engineers
United States
Inria Grenoble Rhône-Alpes
France
Sciences Po
France

View all View all

Keywords

FOS: Computer and information sciences, Sound (cs.SD), audio-visual tracking, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], dynamic Bayesian network, [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], Computer Science - Sound, 004, 620, sound source localization, [INFO.INFO-CV] Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], speaker diarization

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	62
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 1%