
Recent advancements in Self-Supervised Learning (SSL) have shown promising results in Speaker Verification (SV). However, narrowing the performance gap with supervised systems remains an ongoing challenge. Several studies have observed that speech representations from large-scale ASR models contain valuable speaker information. This work explores the limitations of fine-tuning these models for SV using an SSL contrastive objective in an end-to-end approach. Then, we propose a framework to learn speaker representations in an SSL context by fine-tuning a pre-trained WavLM with a supervised loss using pseudo-labels. Initial pseudo-labels are derived from an SSL DINO-based model and are iteratively refined by clustering the model embeddings. Our method achieves 0.99% EER on VoxCeleb1-O, establishing the new state-of-the-art on self-supervised SV. As this performance is close to our supervised baseline of 0.94% EER, this contribution is a step towards supervised performance on SV with SSL.
accepted at INTERSPEECH 2024
[INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], FOS: Computer and information sciences, Sound (cs.SD), Speech Representations, [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], Speaker Recognition, [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], Machine Learning (cs.LG), Machine Learning, Self-Supervised Learning, Sound, ASR, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Audio and Speech Processing
[INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], FOS: Computer and information sciences, Sound (cs.SD), Speech Representations, [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], Speaker Recognition, [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], Machine Learning (cs.LG), Machine Learning, Self-Supervised Learning, Sound, ASR, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Audio and Speech Processing
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
