Prompting Visual-Language Models for Dynamic Facial Expression Recognition

descriptionPublicationkeyboard_double_arrow_right Article , Conference object , Preprint 01 Jan 2023Embargo end date: 01 Jan 2023Publisher:ZenodoJournal:CoRR, volume abs/2308.13382Funded by:EC | AI4Media

Authors: Zengqun Zhao; Ioannis Patras;

doi: 10.5281/zenodo.8364266 , 10.48550/arxiv.2308.13382 , 10.5281/zenodo.8364267

arXiv: 2308.13382

Prompting Visual-Language Models for Dynamic Facial Expression Recognition

- Summary
- Subjects
- Metrics

Abstract

This paper presents a novel visual-language model called DFER-CLIP, which is based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition (DFER). Specifically, the proposed DFER-CLIP consists of a visual part and a textual part. For the visual part, based on the CLIP image encoder, a temporal model consisting of several Transformer encoders is introduced for extracting temporal facial expression features, and the final feature embedding is obtained as a learnable "class" token. For the textual part, we use as inputs textual descriptions of the facial behaviour that is related to the classes (facial expressions) that we are interested in recognising -- those descriptions are generated using large language models, like ChatGPT. This, in contrast to works that use only the class names and more accurately captures the relationship between them. Alongside the textual description, we introduce a learnable token which helps the model learn relevant context information for each expression during training. Extensive experiments demonstrate the effectiveness of the proposed method and show that our DFER-CLIP also achieves state-of-the-art results compared with the current supervised DFER methods on the DFEW, FERV39k, and MAFW benchmarks. Code is publicly available at https://github.com/zengqunzhao/DFER-CLIP.

Accepted at BMVC 2023 (Camera-Ready Version)

Related Organizations

Queen Mary University of London
United Kingdom

Keywords

FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average