Name: Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice
Keywords: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Vision and Pattern Recognition

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 27 May 2024Embargo end date: 01 Jan 2025Publisher:IEEEJournal:2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)

Authors: Bohy, Hugo; Tran, Minh; Haddad, Kevin El; Dutoit, Thierry; Soleymani, Mohammad;

doi: 10.1109/fg59268.2024.10581940 , 10.48550/arxiv.2508.17502

arXiv: 2508.17502

Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

- Summary
- Subjects
- Metrics

Abstract

Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here https://github.com/HuBohy/SocialMAE.

5 pages, 3 figures, IEEE FG 2024 conference

Related Organizations

University of California System
United States
University of Mons
Belgium
USC Institute for Creative Technologies
United States

Keywords

FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Vision and Pattern Recognition

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Green