HTNet for micro-expression recognition

PWC

PWC

Official implementation of our paper:
HTNet for micro-expression recognition
Zhifeng Wang, Kaihao Zhang, Wenhan Luo, Ramesh Sankaranarayana [paper]

A Hierarchical Transformer Network (HTNet) to identify critical areas of facial muscle movement.

Facial expression is related to facial muscle contractions and different muscle movements correspond to different emotional states. For micro-expression recognition, the muscle movements are usually subtle, which has a negative impact on the performance of current facial emotion recognition algorithms. Most existing methods use self-attention mechanisms to capture relationships between tokens in a sequence, but they do not take into account the inherent spatial relationships between facial landmarks. This can result in sub-optimal performance on micro-expression recognition tasks.Therefore, learning to recognize facial muscle movements is a key challenge in the area of micro-expression recognition. In this paper, we propose a Hierarchical Transformer Network (HTNet) to identify critical areas of facial muscle movement. HTNet includes two major components: a transformer layer that leverages the local temporal features and an aggregation layer that extracts local and global semantical facial features. Specifically, HTNet divides the face into four different facial areas: left lip area, left eye area, right eye area and right lip area. The transformer layer is used to focus on representing local minor muscle movement with local self-attention in each area. The aggregation layer is used to learn the interactions between eye areas and lip areas. The experiments on four publicly available micro-expression datasets show that the proposed approach outperforms previous methods by a large margin.

HTNet: Overall architectures of hierarchical transformer network for micro-expression recognition.Low-level self-attention in transformer layerscaptures fine-grained features in local regions. High-level self-attention in transformer layers captures coarse-grained features in global regions. An aggregation block isproposed to create interactions between different blocks at the same level.

The experiments are implemented on SAMM[32], SMIC[33], CASME II[34] and CASME III [35] databases. SAMM, SMIC, and CASME II are merged into one composite dataset,and the same labels in these three datasets are adopted for micro-expression tasks. In these datasets, the “positive” emotion category includes the “happiness” emotion class, and the “negative” emotion category includes “sadness”,“disgust”, “contempt”, “fear” and “anger” emotion classes while “surprise” emotion category only includes “surprise” class:

The Unweighted F1-score (UF1) and Unweighted Average Recall (UAR) performance of handcraft methods, deep learning methods and our HTNet method under LOSO protocol on the composite (Full), SMIC, CASME II and SAMM. Bold text indicates the best result.

The results are listed as follows:

We investigate the effects of the transformer layer’s head count on accuracy in composite datasetsSMIC, SAMM and CASME II. The composite datasets’ Unweighted F1-score (UF1) and Unweighted Average Recall (UAR) performance are reported.

Python code

Step 1) Please download the dataset and put it in /datasets

Step 2) Place the files in the structure as follows:

├─datasets
--three_norm_u_v_os
--combined_datasets_whole
├─main_HTNet.py
├─requirements.txt

Step 3) Installation of packages using pip

pip install -r requirements.txt

Step 4) Training and Evaluation

python main_HTNet.py --train True

Citation

If you find our work useful for your project, please consider citing the paper

@article{wang2024htnet,
  title={Htnet for micro-expression recognition},
  author={Wang, Zhifeng and Zhang, Kaihao and Luo, Wenhan and Sankaranarayana, Ramesh},
  journal={Neurocomputing},
  volume={602},
  pages={128196},
  year={2024},
  publisher={Elsevier}
}

Thank you for your interest and support.