A Survey on Facial Expression Recognition of Static and Dynamic Emotions

This is the official repository for our paper entitled “A Survey on Facial Expression Recognition of Static and Dynamic Emotions”.

Yan Wang 1, Shaoqi Yan 1, Yang Liu 1, Wei Song 3, Jing Liu 1, Yang Chang 1, Xinji Mai 1, Xiping Hu 4
, Wenqiang Zhang 1, 2 Zhongxue Gan 1

1 Academy for Engineering & Technology, Fudan University 2 School of Computer Science, Fudan University 3 Shanghai Ocean University 4 Beijing Institute of Technology

Taxonomy Overview

images/taxonomy_overview.png

Taxonomy of FER of static and dynamic emotions. We present a hierarchical taxonomy that categorizes existing FER models by input type, task challenges, and network structures within a systematic framework, aiming to provide a comprehensive overview of the current FER research landscape. First, we have introduced datasets, metrics, and workflow (including literature and codes) into a public GitHub repository. Then, image-based SFER and video- based DFER overcome different task challenges using various learning strategies and model designs. Following, we analyzed recent advances of FER on benchmark datasets. Finally, we discussed and concluded some important issues and potential trends in FER, highlighting directions for future developments.

Comparisons with SOTA FER-related reviews

images/comparisons.png

Datasets

images/Datasets.jpg

Image-based static facial frames (Above) and video-based dynamic facial sequences (Below) of seven basic emotions in the lab and wild. Samples are from (a) JAFFE, (b) CK+, (c) SFEW, (d) ExpW, (e) RAF-DB, (f) AffectNet, (g) EmotioNet, (h) CK+, (i) Oulu-CASIA, (j) DFEW, (k) FERV39k, and (l) MAFW.

Categories Datasets Year ECT Emotion Training Numbers Testing Numbers
Modality Scene
Image-based SFER Datasets Lab JAFFE 1998 P Sev 213 213
CK+ 2010 P/I Sev 241 241
MMI 2010 P Sev 370 370
Oulu-CASIA 2011 P Sev 720 240
RaFD 2010 P Sev, C 1,448 160
Wild FER-2013 2013 P/I Sev 28,709 3,589
SFEW 2.0 2011 P/I Sev 958 436
EmotioNet 2016 P/I Sev, C 80,000 20,000
RAF-DB 2017 P/I Sev, Com 12,271 3,068
AffectNet 2017 P/I Sev, Con. 283,901 3,500
ExpW 2017 P/I Sev 75,048 16,745
Lab (3D) BU-3DFE 2006 P Sev 2,000 500
Bosphorus 2008 P Sev 2,326 2,326
4DFAB 2018 P/I Sev 1,440k 360k
Video-based DFER Datasets Lab CK+ 2010 P/I Sev 241 241
MMI 2010 P/I Sev 1,450 1,450
Oulu-CASIA 2011 P Six 2,160 720
Wild AFEW 8.0 2011 P/I Sev 773 383
CAER 2019 P/I Sev 9,240 2,640
DFEW 2020 P/I Sev 12,000 3,000
FERV39k 2022 P/I SE 35,887 3,000
MAFW 2022 P/I Sev, C, A, D, H, Com 8,036 2,009

Summary of the in-the-lab or in-the-wild datasets with static and dynamic emotions for FER training and evaluation. ECT: Elicitation; P: Posed; I: Instinctive; Sev: Seven Emotions (Happy, Angry, Surprise, Fear, Sad, Disgust, Neutral); C: Contempt; A: Anxiety; D: Disappointment; H: Helplessness; Com: Compound.

Workflow of Generic Facial Expression Recognition

images/Workflow.jpg

The workflow and main components of generic facial expression recognition.

Image-based Static FER

Image-based static facial expression recognition (SFER) involves extracting features from a single image, which captures complex spatial information that related to facial expressions, such as landmarks, and their geometric structures and relationships. In the following, we will first introduce the general architecture of SFER, and then elaborate specific design of SFER methods from the challenge-solving perspectives, including disturbance-invariant SFER, 3D SFER, uncertainty-aware SFER, compound SFER, cross-domain SFER, limited-supervised SFER, and cross-modal SFER.

General SFER

images/gernearl_based_SFER_00.jpg

The architecture of general SFER. Figure is reproduced based on (a) CNN-based model, (b) GCN-based model, and (c) Transformer-based model.

Disturbance-invariant SFER

images/Disturbance_invariant_SFER_00.jpg

The architecture of disturbance-invariant SFER. Figure is reproduced based on (a) Attention-based model (AMP-Net) and (b) Decomposition-based model.

3D SFER

images/3D_SFER_00.jpg

The architecture of 3D SFER. Figure is reproduced based on (a) GAN-based learning (GAN-Int) and (b) Multi-view learning (MV-CNN).

Uncertainty-aware SFER

images/Uncertainy_aware_SFER_00.jpg

The architecture of uncertainty-aware SFER. Figure is reproduced based on (a) the label uncertainty learning (LA-Net) and (b) data uncertainty learning (LNSU-Net).

Compound SFER

Compound emotions refer to complex emotional states formed by the combination of at least two basic emotions, which are not independent, discrete categories but exist within a continuous emotional spectrum composed of multiple dimensions. Compared with discrete "basic" emotions or a few dimensions, compound emotions provide a more accurate representation of the diversity and continuity of human complex emotions.

Cross-domain SFER

images/crossdomain_sfer.jpg

The architecture of cross-domain SFER. Figure is reproduced based on (a) the transfer learning-based model (CSRL) and (b) the adaption learning-based model (AGRA).

Weak-supervised SFER

images/Weak-supervised.jpg

The architecture of weak-supervised SFER. Figure is reproduced based on the Ada-CM.

Cross-modal SFER

images/Cross-modal.png

The architecture of cross-modal SFER. Figure is reproduced based on the CEprompt.

Video-based Dynamic Facial Expression Recognition

The video-based DFER involves analyzing facial expressions that change over time, necessitating a framework that effectively integrates spatial and temporal information. The core objective of DFER is to extract and learn the features of expression changes from video sequences or image sequences. Due to the complexity and diversity of input video or image sequences, DFER faces various task challenges. Based on different solution approaches, these challenges can be categorized into seven basic types: general DFER, sampling-based DFER, expression intensity-aware DFER, multi-modal DFER, static to dynamic FER, self-supervised DFER, and cross-modal DFER.

General DFER

images/generaldfer.png

The architecture of general DFER. Figure is reproduced based on (a) CNN-RNN based model (SAANet) and (b) the transformer-based model (EST).

Sampling-based DFER

images/fig9-Sampling-based_dfer_00.jpg

The architecture of sampling-based DFER. Figure is reproduced based on explainable sampling (Freq-HD).

Expression Intensity-aware DFER

Facial expressions are inherently dynamic, with intensity either gradually shifting from neutral to peak and back or abruptly transitioning from peak to neutral, making the accurate capture of these fluctuations essential for understanding expression dynamics.

Static to Dynamic FER

The static to dynamic FER utilized the high-performance SFER knowledge to explore appearance features and dynamic dependencies.

Multi-modal DFER

images/multi_modal_fusion_dfer.png

The architecture of multi-modal DFER. Figure is reproduced based on the fusion-based model (T-MEP).

Self-supervised DFER

images/Self_supervised_DFER_00.jpg

The architecture of self-supervised DFER. This is reproduced based on the MAE-DFER.

Visual-Language DFER

images/DFER_CLIP.png

The architecture of vision-language DFER. Figure is reproduced based on DFER-CLIP.

RECENT ADVANCES OF FER ON BENCHMARK DATASETS

Performance (WAR) of image-based SFER and video-based DFER methods on four in-the-lab datasets:

Method Year Type Backbone MMI CK+ Oulu-CASIA
IL-VGG 2018 Static VGG-16 74.68 91.64 84.58
FMPN 2019 Static CNNs 82.74 98.60 -
LDL-ALSG 2020 Static ResNet-50 70.03 93.08 63.94
IE-DBN 2021 Static VGG-16 - 96.02 85.21
im-cGAN 2023 Static GAN - 98.10 93.34
Mul-DML 2024 Static ResNet-18 81.57 98.47 -
STC-NLSTM 2018 Dynamic 3DCNN 84.53 99.80 93.45
SAANet 2020 Dynamic VGG-16 - 97.38 82.41
MGLN 2020 Dynamic VGG-16 - 98.77 90.40
MSDmodel 2021 Dynamic CNN 89.99 99.10 87.33
DPCNet 2022 Dynamic CNN - 99.70 -
STACM 2023 Dynamic CNN 82.71 99.08 91.25

Performance (WAR) of image-based SFER methods on three in-the-wild datasets:

Task Challenges Method Year Backbone SFEW RAF-DB AffectNet
General SFER IFSL 2020 VGG16 46.50 76.90 -
OAENet 2021 VGG16 - 86.50 58.70
MA-Net 2021 ResNet18 - 88.40 64.53
D³Net 2021 ResNet18 62.16 88.79 -
TransFER 2021 ResNet50 - 90.91 66.23
VTFF 2023 Transformer - 88.14 61.85
HASs 2023 ResNet50 65.14 91.04 -
APViT 2023 Transformer 61.92 91.98 66.91
POSTER 2023 CNN-IR50 - 92.05 67.31
MGR³Net 2024 ResNet50 - 91.05 66.36
Disturbance-invariant SFER PG-Unit 2018 VGG16 - 83.27 55.33
IDFL 2021 ResNet50 - 86.96 59.20
FDRL 2021 ResNet18 62.16 89.47 -
AMP-Net 2022 ResNet50 - 88.06 63.23
PACVT 2023 ResNet18 - 88.21 60.68
IPD-FER 2023 ResNet18 58.43 88.89 -
Latent-OFER 2023 ResNet18 - 89.60 -
RAC+RSL 2023 ResNet18 - 89.77 62.16
Uncertainty-aware SFER SCN 2020 ResNet18 - 87.03 60.23
DMUE 2021 ResNet18 57.12 88.76 62.84
RUL 2021 ResNet18 - 88.98 -
EASE 2022 VGG16 60.12 89.56 61.82
EAC 2022 ResNet18 - 89.99 65.32
LA-Net 2023 ResNet18 - 91.56 64.54
LNSU-Net 2024 ResNet18 - 89.77 65.73
Weak-supervised SFER Ada-CM 2022 ResNet18 52.43 84.42 57.42
E2E-WS 2022 ResNet18 54.56 88.89 60.04
DR-FER 2023 ResNet50 - 90.53 66.85
WSCFER 2023 IResNet - 91.72 67.71
Cross-modal SFER CLEF 2023 CLIP - 90.09 65.66
VTA-Net 2024 ResNet-18 - 72.17 -
CEPrompt 2024 ViT-B/16 - 92.43 67.29

Performance (Accuracy) of 3D SFER methods on BU-3DE and Bosphorus datasets:

Method Year Backbone Modality BU-3DE Bosphorus
JPE-GAN 2018 CNN 2D/- 81.20/- -/-
DA-CNN 2019 ResNet50 -/3D -/87.69 -/-
GAN-Int 2021 VGGNet16 2D+3D/3D 88.47/83.20 -/-
FFNet-M 2021 VGGNet16 2D+3D/3D 89.82/87.28 87.65/82.86
CMANet 2022 VGGNet16 2D+3D/3D 90.24/84.03 89.36/81.25
DrFER 2024 ResNet18 -/3D -/89.15 -/86.77

Performance (WAR) of cross-domain SFER methods on four widely-used datasets:

Method Year Backbone Source Dataset JAFFE CK+ FER-2013 AffectNet
ECAN 2022 ResNet50 RAF-DB 57.28 79.77 56.46 -
AGRA 2022 ResNet50 RAF-DB 61.5 85.27 58.95 -
PASM 2022 VGGNet16 RAF-DB - 79.65 54.78 -
CWCST 2023 VGGNet16 RAF-DB2.0 69.01 89.64 57.44 52.66
DMSRL 2023 VGGNet16 RAF-DB2.0 69.48 91.26 56.16 50.94
CSRL 2023 ResNet18 RAF-DB 66.67 88.37 55.53 -

Performance (WAR/UAR) of video-based DFER methods on four widely-used datasets. TI: Time Interpolation; DS: Dynamic Sampling; GWS: Group-weighted Sampling. *: Tunable Param (M):

Task Challenges Method Year Sample Strategies Backbone Complexity (GFLOPs) AFEW (WAR/UAR) DFEW (WAR/UAR) FERV39k (WAR/UAR) MAFW (WAR/UAR)
General DFER TFEN 2021 TI ResNet-18 - - 56.60/45.57 - -
FormerDFER 2021 DS Transformer 9.1G 50.92/47.42 65.70/53.69 - 43.27/31.16
EST 2023 DS ResNet-18 N/A 54.26/49.57 65.85/53.94 - -
LOGO-Former 2023 DS ResNet-18 10.27G - 66.98/54.21 48.13/38.22 -
MSCM 2023 DS ResNet-18 8.11G 56.40/52.30 70.16/58.49 - -
SFT 2024 DS ResNet-18 17.52G 55.00/50.14 - 47.80/35.16 47.44/33.39
CDGT 2024 DS Transformer 8.3G 55.68/51.57 70.07/59.16 50.80/41.34 -
LSGTNet 2024 DS ResNet-18 - - 72.34/61.33 51.31/41.30 -
Sampling-based DFER EC-STFL 2020 TI ResNet-18 8.32G 53.26/- 54.72/43.60 - -
DPCNet 2022 GWS ResNet-50 9.52G 51.67/47.86 66.32/57.11 - -
FreqHD 2023 FreqHD ResNet-18 - - 54.98/44.24 43.93/32.24 -
M3DFEL 2023 DS R3D18 1.66G - 69.25/56.10 47.67/35.94 -
Expression Intensity-aware DFER CEFL-Net 2022 Clip-based ResNet-18 - 53.98/- 65.35/- - -
NR-DFERnet 2023 DS ResNet-18 6.33G 53.54/48.37 68.19/54.21 - -
GCA+IAL 2023 DS ResNet-18 9.63G - 69.24/55.71 48.54/35.82 -
Static to Dynamic FER S2D 2023 DS ViT-B/16 - - 76.03/61.82 52.56/41.28 57.37/41.86
AEN 2023 DS Transformer - 54.64/50.88 69.37/56.66 47.88/38.18 -
Multi-modal DFER T-ESFL 2022 DS Transformer - - - - 48.18/33.28
T-MEP 2023 DS - 6G 52.96/50.22 68.85/57.16 - 52.85/39.37
OUS 2024 DS CLIP - 52.96/50.22 68.85/57.16 - 52.85/39.37
MMA-DFER 2024 DS Transformer - - 77.51/67.01 - 58.52/44.11
Self-supervised DFER MAE-DFER 2023 DS ResNet-18 50G - 74.43/63.41 52.07/43.12 54.31/41.62
HiCMAE 2024 DS ResNet-18 32G - 73.10/61.92 - 54.84/42.10
Visual-Language DFER CLIPER 2023 DS CLIP-ViT-B/16 88M* 56.43/52.00 70.84/57.56 51.34/41.23 -
DFER-CLIP 2023 DS CLIP-ViT-B/32 92G - 71.25/59.61 51.65/41.27 52.55/39.89
EmoCLIP 2024 DS CLIP-ViT-B/32 - - 62.12/58.04 36.18/31.41 41.46/34.24
A³lign-DFER 2024 DS CLIP-ViT-L/14 - - 74.20/64.09 51.77/41.87 53.22/42.07
UMBEnet 2024 DS CLIP - - 73.93/64.55 52.10/44.01 57.25/46.92
FineCLIPER 2024 DS CLIP-ViT-B/16 20M* - 76.21/65.98 53.98/45.22 56.91/45.01

Citation

If you find our work useful, please cite our paper:

@article{wang_surveyfer_2024,
       title = {A Survey on Facial Expression Recognition of Static and Dynamic Emotions},
       author = {Wang, Yan and Yan, Shaoqi and Liu, Yang and Song, Wei and Liu, Jing and Chang, Yang and Mai, Xinji and Hu, Xiping and Zhang, Wenqiang and Gan, Zhongxue},
       journal = {arXiv preprint arXiv:2408.15777},
       year = {2024}
}