Name: Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models
Keywords: [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], FOS: Computer and information sciences, Sound (cs.SD), Speech Representations, [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], Speaker Recognition, [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], Machine Learning (cs.LG), Machine Learning, Self-Supervised Learning

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 01 Sep 2024Embargo end date: 01 Jan 2024Publisher:ISCAJournal:Interspeech 2024Funded by:ANR | APATE

Authors: Miara, Victor; Lepage, Theo; Dehak, Reda;

doi: 10.21437/interspeech.2024-486 , 10.48550/arxiv.2406.02285

arXiv: http://arxiv.org/abs/2406.02285

Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

Recent advancements in Self-Supervised Learning (SSL) have shown promising results in Speaker Verification (SV). However, narrowing the performance gap with supervised systems remains an ongoing challenge. Several studies have observed that speech representations from large-scale ASR models contain valuable speaker information. This work explores the limitations of fine-tuning these models for SV using an SSL contrastive objective in an end-to-end approach. Then, we propose a framework to learn speaker representations in an SSL context by fine-tuning a pre-trained WavLM with a supervised loss using pseudo-labels. Initial pseudo-labels are derived from an SSL DINO-based model and are iteratively refined by clustering the model embeddings. Our method achieves 0.99% EER on VoxCeleb1-O, establishing the new state-of-the-art on self-supervised SV. As this performance is close to our supervised baseline of 0.94% EER, this contribution is a step towards supervised performance on SV with SSL.

accepted at INTERSPEECH 2024

Related Organizations

Sorbonne University
France
Graduate School of Computer Science and Advanced Technologies
France
Sorbonne Paris Cité
France

Keywords

[INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], FOS: Computer and information sciences, Sound (cs.SD), Speech Representations, [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], Speaker Recognition, [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], Machine Learning (cs.LG), Machine Learning, Self-Supervised Learning, Sound, ASR, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Audio and Speech Processing

2 Research products, page 1 of 1

wavlm_ssl_sv software on GitHub
IsRelatedTo
unilm software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Green

Funded by

ANR| APATE

Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models

Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models

2 Research products, page 1 of 1

wavlm_ssl_sv software on GitHub

unilm software on GitHub