Name: A Vector Quantized Masked Autoencoder for Speech Emotion Recognition
Keywords: Self-supervised learning, [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), masked autoencoder, [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], [STAT.ML] Statistics [stat]/Machine Learning [stat.ML], Computer Science - Sound, Machine Learning (cs.LG)

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 04 Jun 2023Embargo end date: 01 Jan 2023Publisher:IEEEJournal:2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)

Authors: Sadok, Samir; Leglaive, Simon; Séguier, Renaud;

doi: 10.1109/icasspw59220.2023.10193151 , 10.48550/arxiv.2304.11117

arXiv: http://arxiv.org/abs/2304.11117

A Vector Quantized Masked Autoencoder for Speech Emotion Recognition

- Summary
- Subjects
- Metrics

Abstract

Recent years have seen remarkable progress in speech emotion recognition (SER), thanks to advances in deep learning techniques. However, the limited availability of labeled data remains a significant challenge in the field. Self-supervised learning has recently emerged as a promising solution to address this challenge. In this paper, we propose the vector quantized masked autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned to recognize emotions from speech signals. The VQ-MAE-S model is based on a masked autoencoder (MAE) that operates in the discrete latent space of a vector-quantized variational autoencoder. Experimental results show that the proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset and fine-tuned on emotional speech data, outperforms an MAE working on the raw spectrogram representation and other state-of-the-art methods in SER.

https://samsad35.github.io/VQ-MAE-Speech/

Related Organizations

Institut des sciences de l'ingénierie et des systèmes
France
University of Rennes 1
France
French National Centre for Scientific Research
France
Université de Rennes 1
France
Institute of Electronics and Telecommunications of Rennes
France

View all View all

Keywords

Self-supervised learning, [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), masked autoencoder, [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], [STAT.ML] Statistics [stat]/Machine Learning [stat.ML], Computer Science - Sound, Machine Learning (cs.LG), vector-quantized variational autoencoder, speech emotion recognition, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing, Electrical Engineering and Systems Science - Audio and Speech Processing

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	19
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

Top 10%

Green