Facial Expression Recognition of Static and Dynamic Emotions: A Tutorial and Review

1Fudan University, 2Shanghai Ocean University, 3Beijing Institute of Technology

Taxonomy Overview

Taxonomy Overview Image.

Taxonomy of FER of static and dynamic emotions. We present a hierarchical taxonomy that categorizes existing FER models by input type, task challenges, and network structures within a systematic framework, aiming to provide a comprehensive overview of the current FER research landscape. First, we have introduced datasets, metrics, and workflow (including literature and codes) into a public GitHub repository3 (Sec. 1, 2, 3). Then, image-based SFER (Sec. 4) and video- based DFER (Sec. 5) overcome different task challenges using various learning strategies and model designs. Following, we analyzed recent advances of FER on benchmark datasets (Sec. 6). Finally, we discussed and concluded some important issues and potential trends in FER, highlighting directions for future developments (Sec. 7, 8, 9).

Abstract

Facial expression recognition (FER) is pivotal in analyzing emotional states from static images and dynamic sequences, leveraging AI technologies to enhance anthropomorphic communication among humans, robots, and digital avatars. As the field evolves from controlled laboratory environments to more complex in-the-wild scenarios, existing FER-related reviews fail to adequately address the task challenges encountered in new contexts. This paper offers a comprehensive survey of both image-based static FER (SFER) and video-based dynamic FER (DFER) methods, analyzing from model-oriented development to challenge-focused categorization.

We begin with a critical comparison of recent reviews, an introduction to common datasets and evaluation criteria, and an in-depth workflow on FER to establish a robust research foundation. We then systematically review representative approaches addressing eight main challenges in SFER (such as expression disturbance, uncertainties, compound emotions, and cross-domain inconsistency) as well as seven main challenges in DFER (such as key frame sampling, expression intensity variations, and cross-modal alignment). Additionally, we analyze recent advancements, benchmark performances, major applications, and ethical considerations. Finally, we propose five promising future directions and development trends to guide ongoing research.

Comparisons with SOTA FER-related reviews

Comparisons with SOTA FER-related reviews.

Datasets

Datasets.

Image-based static facial frames (Above) and video-based dynamic facial sequences (Below) of seven basic emotions in the lab and wild. Samples are from (a) JAFFE [27], (b) CK+ [28], (c) SFEW [29], (d) ExpW [30], (e) RAF-DB [31], (f) AffectNet [32], (g) EmotioNet [33], (h) CK+ [28], (i) Oulu-CASIA [34], (j) DFEW [15], (k) FERV39k [35], and (l) MAFW [36].

Workflow of Generic Facial Expression Recognition

Datasets.

The workflow and main components of generic facial expression recognition.

Image-based Static FER

Image-based static facial expression recognition (SFER) involves extracting features from a single image, which captures complex spatial information that related to facial expressions, such as landmarks, and their geometric structures and relationships. In the following, we will first introduce the general architecture of SFER, and then elaborate specific design of SFER methods from the challenge-solving perspectives, including disturbance-invariant SFER, 3D SFER, uncertainty-aware SFER, compound SFER, cross-domain SFER, limited-supervised SFER, and cross-modal SFER.

General SFER

General SFER Image.

The architecture of general SFER. Figure is reproduced based on (a) CNN-based model, (b) GCN-based model, and (c) Transformer-based model

Disturbance-invariant SFER

Disturbance-invariant SFER Image.

The architecture of disturbance-invariant SFER. Figure is reproduced based on (a) Attention-based model (AMP-Net) and (b) Decomposition-based model.

3D SFER

3D SFER Image.

The architecture of 3D SFER. Figure is reproduced based on (a) GAN-based learning (GAN-Int) and (b) Multi-view learning (MV-CNN).

Uncertainty-aware SFER

Uncertainty-aware SFER Image.

The architecture of uncertainty-aware SFER. Figure is reproduced based on (a) the label uncertainty learning (LA-Net) and (b) data uncertainty learning(LNSU-Net).

Compound SFER

Compound emotions refer to complex emotional states formed by the combination of at least two basic emotions, which are not independent, discrete categories but exist within a continuous emotional spectrum composed of multiple dimensions. Compared with discrete "basic" emotions or a few dimensions, compound emotions provide a more accurate representation of the diversity and continuity of human complex emotions.

Cross-domain SFER

Cross-domain SFER Image.

The architecture of cross-domain SFER. Figure is reproduced based on (a) the transfer learning-based model (CSRL) and (b) the adaption learning-based model (AGRA).

Weak-supervised SFER

Weak-supervised SFER Image.

The architecture of weak-supervised SFER. Figure is reproduced based on the Ada-CM.

Cross-modal SFER

Cross-modal SFER Image.

The architecture of cross-modal SFER. Figure is reproduced based on the CEprompt.

Video-based Dynamic Facial Expression Recognition

The video-based DFER involves analyzing facial expressions that change over time, necessitating a framework that effectively integrates spatial and temporal information. The core objective of DFER is to extract and learn the features of expression changes from video sequences or image sequences. Due to the complexity and diversity of input video or image sequences, DFER faces various task challenges. Based on different solution approaches, these challenges can be categorized into seven basic types: general DFER, sampling-based DFER, expression intensity-aware DFER, multi-modal DFER, static to dynamic FER, self-supervised DFER, and cross-modal DFER.

General DFER

General DFER Image.

The architecture of general DFER. Figure is reproduced based on (a) CNN-RNN based model (SAANet) and (b) the transformer-based model (EST).

Sampling-based DFER

Sampling-based DFER Image.

The architecture of sampling-based DFER. Figure is reproduced based on explainable sampling (Freq-HD).

Expression Intensity-aware DFER

Facial expressions are inherently dynamic, with intensity either gradually shifting from neutral to peak and back or abruptly transitioning from peak to neutral, making the accurate capture of these fluctuations essential for understanding expression dynamics.

Static to Dynamic FER

The static to dynamic FER utilized the high-performance SFER knowledge to explore appearance features and dynamic dependencies.

Multi-modal DFER

Multi-modal DFER Image.

The architecture of multi-modal DFER. Figure is reproduced based on the fusion-based model (T-MEP).

Self-supervised DFER

Self-supervised DFER Image.

The architecture of self-supervised DFER. This is reproduced based on the MAE-DFER.

Visual-Language DFER

Visual-Language DFER Image.

The architecture of vision-language DFER. Figure is reproduced based on DFER-CLIP.

BibTeX

@article{,
  author    = {Yan Wang, Shaoqi Yan, Yang Liu, Wei Song, Jing Liu, Yang Chang, Xinji Mai, Xiping Hu, Wenqiang Zhang, Zhongxue Gan},
  title     = {A Survey on Facial Expression Recognition of Static and Dynamic Emotions},
  journal   = {arXiv preprint arXiv:2408.15777},
  year      = {2024},
}