A Survey on Deep Multi-modal Learning for Body Language Recognition and Generation

arXiv, 2023
Li Liu · Lufei Gao · Wentao Lei · Fengji Ma · Xiaotian Lin
Jinting Wang

This repository is used for recording and tracking some Multi-modal Body Language researchs, as a supplement to our survey.
If you find any work missing or have any suggestions (papers, implementations and other resources), please don't hesitate to open an issue or pull request or just contact us by e-mail. We will check the problems and add the missing papers to this repo ASAP.

🔥News

[-2023.8.17 ] The first draft is on arxiv.

🔥Highlight!!

[1] We re-visit and group the existing Body Language researchs from the Multi-modal perspective.

[2] We survey the research in 4 parts: Cued Speech, Co-speech, Sign Language, Talking Head.

[3] We survey the research in 2 directions: Recognition and Generation.

[4] Some new insight for this directions are discussed.

Introduction

In this survey, we present the first detailed survey on Multi-modal Body Language research.

Summary of Contents

Paper List

Paper-List

Cued Speech Recognition

Year	Venue	Acronym	Paper Title	Code/Project
2010	Speech Communication	Heracleous et al.	Cued speech automatic recognition in normal-hearing and deaf subjects	N/A
2012	EUSIPCO	Heracleous et al.	Continuous phoneme recognition in Cued Speech for French	N/A
2018	Interspeech	Liu et al.	Visual Recognition of Continuous Cued Speech Using a Tandem CNN-HMM Approach	N/A
2020	IEEE Transactions on Multimedia	Liu et al.	Re-synchronization using the hand preceding model for multi-modal fusion in automatic continuous cued speech recognition	N/A
2021	EUSIPCO	Papadimitriou et al.	A Fully Convolutional Sequence Learning Approach for Cued Speech Recognition from Videos	N/A
2021	HCII	Papadimitriou et al.	Multimodal Fusion and Sequence Learning for Cued Speech Recognition from Videos	N/A
2021	arXiv preprint	Wang et al.	Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition	N/A
2021	arXiv preprint	Wang et al.	An Attention Self-supervised Contrastive Learning based Three-stage Model for Hand Shape Feature Representation in Cued Speech	N/A
2022	ICASSP	Sankar et al.	Multistream Neural Architectures for Cued Speech Recognition Using a Pre-Trained Visual Feature Extractor and Constrained CTC Decoding	N/A
2022	ISCSLP	Liu et al.	Objective Hand Complexity Comparison between Two Mandarin Chinese Cued Speech Systems	N/A
2023	ICASSP	Liu et al.	Cross-Modal Mutual Learning for Cued Speech Recognition	N/A

Co-speech Recognition

Year	Venue	Acronym	Paper Title	Code/Project
2014	MA3HMI	Bhattacharya et al.	Disposition Recognition from Spontaneous Speech Towards a Combination with Co-speech Gestures	N/A
2021	ACM MM	Böck et al.	Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning	N/A

Sign Language Recognition

Year	Venue	Acronym	Paper Title	Code/Project
2019	ICIP	Zhang et al.	Continuous Sign Language Recognition via Reinforcement Learning	N/A
2020	ECAI	Zhou et al.	Self-Attention-based Fully-Inception Networks for Continuous Sign Language Recognition	N/A
2020	ICASSP	Li et al.	Key Action and Joint CTC-Attention based Sign Language Recognition	N/A
2020	ECCV	Cheng et al.	Fully Convolutional Networks for Continuous Sign Language Recognition	N/A
2020	ECCV	Niu et al.	Stochastic Fine-Grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition	N/A
2021	ICPR	Koishybay et al.	Continuous Sign Language Recognition with Iterative Spatiotemporal Fine-tuning	N/A
2022	CVPR	Zuo et al.	C2SLR: Consistency-Enhanced Continuous Sign Language Recognition	N/A
2022	IEEE Transactions on Multimedia	Zhou et al.	Spatial-Temporal Multi-Cue Network for Sign Language Recognition and Translation	N/A
2022	NeurIPS	Chen et al.	Two-Stream Network for Sign Language Recognition and Translation	N/A
2023	CVPR	Zheng et al.	CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition With Variational Alignment	Code
2023	TPAMI	Bilge et al.	Towards Zero-Shot Sign Language Recognition	N/A
2023	AAAI	Hu et al.	Self-Emphasizing Network for Continuous Sign Language Recognition	Code

Cued Speech Generation

Year	Venue	Acronym	Paper Title	Code/Project
1998	ISCA	Paul et al.	Automatic Generation of Cued Speech for The Deaf: Status and Outlook	N/A
2008	AVSP	G ́erard et al.	Retargeting cued speech hand gestures for different talking heads and speakers	N/A

Co Speech Generation

Year	Venue	Acronym	Paper Title	Code/Project
2015	IVA	DCNF	Predicting co-verbal gestures: A deep and temporal modeling approach	N/A
2019	CVPR	S2G	Learning individual styles of conversational gesture	Code
2020	EUROGRAPHICS	StyleGestures	Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows	Code
2021	ICCV	A2G	Audio2Gestures: Generating Diverse Gestures from Speech Audio withConditional Variational Autoencoders	Code
2021	IEEE VR	Text2Gestures	Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents	Code
2022	Computer Graphics Forum	ZeroEGGS	ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech	Code
2022	CVPR	DiffGAN	Low-Resource Adaptation for Personalized Co-Speech Gesture Generation	N/A
2022	SIGGRAPH Asia	RG	Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings	Code

Sign Language Generation

Year	Venue	Acronym	Paper Title	Code/Project
2016	Universal Access in the Information Society	Sign3D	Interactive editing in French Sign Language dedicated to virtual signers: requirements and challenges	N/A
2018	AAAI	DETR	Hierarchical LSTM for Sign Language Translation	N/A
2020	IJCV	text2gesture	Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks	N/A
2020	CVPR	ESN	Everybody Sign Now:Translating Spoken Language to Photo Realistic Sign Language Video	N/A
2020	BMVC	Saunders et al.	Adversarial Training for Multi-Channel Sign Language Production	N/A
2022	ACL	DSM	Modeling Intensification for Sign Language Generation: A Computational Approach	Code
2022	CVPR	SignGAN	Signing at Scale: Learning to Co-Articulate Signs for Large-Scale Photo-Realistic Sign Language Production	N/A
2023	CVPR	PoseVQ-Diffusion	Vector Quantized Diffusion Model with CodeUnet for Text-to-Sign Pose Sequences Generation	Code

Talking Head Generation

Year	Venue	Acronym	Paper Title	Code/Project
2018	ECCV	X2Face	X2Face: A network for controlling face generation using images, audio, and pose codes	N/A
2018	ECCV	Chen et al.	Lip Movements Generation at a Glance	Code
2019	NeurIPS	Wen et al.	Face Reconstruction from Voice using Generative Adversarial Networks	Code
2019	CVPR	Speech2Face	Speech2Face: Learning the Face Behind a Voice	N/A
2019	ICASSP	Wav2Pix	WAV2PIX: Speech-conditioned Face Generation using Generative Adversarial Networks	Code
2019	IJCV	Jamaludin et al.	You Said That?: Synthesising Talking Faces from Audio	N/A
2019	IJCAI	Song et al.	Talking Face Generation by Conditional Recurrent Adversarial Network	N/A
2019	AAAI	Zhou et al.	Talking face generation by adversarially disentangled audio-visual representation	N/A
2019	CVPR	Kefalas et al .	End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs	N/A
2020	ICCASP	Kefalas et al .	Speech-Driven Facial Animation Using Polynomial Fusion of Features	N/A
2020	ICASSP	Eskimez et al.	End-To-End Generation of Talking Faces from Noisy Speech	N/A
2020	IJCNN	Sinha et al.	Identity-Preserving Realistic Talking Face Generation	N/A
2020	INTERSPEECH	Wang et al.	Speech Driven Talking Head Generation via Attentional Landmarks Based Representation	N/A
2020	ACM MM	Wav2lip	A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild	N/A
2020	arXiv preprint	Yi et al.	Audio-driven talking face video generation with learning-based personalized head pose	N/A
2020	ECCV	Chen et al.	Talking-Head Generation with Rhythmic Head Motion	Code
2020	WACV	Mittal et al.	Animating Face using Disentangled Audio Representations	N/A
2020	ECCV	MEAD	MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation	Code
2020	TVCG	Wen et al.	Photorealistic Audio-driven Video Portraits	Code
2021	CVPR	LipSync3D	LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces From Video Using Pose and Lighting Normalization	N/A
2021	The Visual Computer	Fang et al.	Facial expression GAN for voice-driven face generation	N/A
2021	IJCAI	Zhu et al.	Arbitrary talking face generation via attentional audio-visual coherence learning	N/A
2021	JCAI	Audio2head	Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion	N/A
2021	ACM TOG	Lu et al.	Live speech portraits: real-time photorealistic talking-head animation	N/A
2021	ICCV	FACIAL	FACIAL: Synthesizing Dynamic Talking Face With Implicit Attribute Learning	N/A
2021	ICCV	AD-NeRF	AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis	Code
2021	CVPR	MEAD	Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset	Code
2021	arXiv preprint	Si et al.	Speech2Video: Cross-Modal Distillation for Speech to Video Generation	N/A
2021	arXiv preprint	Chen et al.	Talking Head Generation with Audio and Speech Related Facial Action Units	N/A
2021	CVPR	PC-AVS	Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation	Code
2022	CVPR	GC-VAT	Expressive Talking Head Generation With Granular Audio-Visual Control	N/A
2022	AAAI	Wang et al.	One-Shot Talking Face Generation from Single-Speaker Audio-Visual Correlation Learning	N/A
2022	ACM SIGGRAPH	EAMM	EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model	N/A
2022	arXiv preprint	SPACE	SPACE: Speech-driven Portrait Animation with Controllable Expression	N/A
2022	arXiv preprint	DFA-NERF	DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering	N/A
2022	arXiv preprint	Yu et al.	Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors	N/A
2022	ACCESS	Bigioi et al.	Pose-Aware Speech Driven Facial Landmark Animation Pipeline for Automated Dubbing	N/A
2022	ECCV	DFRF	Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis	Code
2022	ECCV	SSP-NeRF	Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation	Code
2023	arXiv preprint	DIRFA	Audio-Driven Talking Face Generation with Diverse yet Realistic Facial Animations	N/A
2023	ICASSP	DisCoHead	DisCoHead: Audio-and-Video-Driven Talking Head Generation by Disentangled Control of Head Pose and Facial Expressions	N/A
2023	ICASSP	OPT	OPT: One-shot Pose-Controllable Talking Head Generation	N/A
2023	ICASSP	Zhua et al.	Audio-Driven Talking Head Video Generation with Diffusion Model	N/A
2023	CVPR	Wang et al.	Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis	N/A
2023	ICPADS	Zhang et al.	Talking Head Generation for Media Interaction System with Feature Disentanglement	N/A
2023	CVPR	SadTalker	SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation	Code
2023	CVPR	DiffTalk	DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation	Code
2023	CoRR		Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator	N/A

Challenges

Year	Task	Language	Name	Link
2021	Sign Language Recognition	English	ChaLearn Looking at People	link
2022	Sign Language Recognition, Translation & Production	English	SLRTP	link
2023	Sign Language Recognition	English	Google - Isolated Sign Language Recognition	link
2023	Sign Language Recognition	Multiple	WMT-SLT 23	link
2018	Lip Reading Recognition	Japanese	SSSD	link
2022	Talking Head Generation	English	ViCo2022	link
2023	Talking Head Generation	English	ViCo2023	link
2020	Co-speech Generation	English	GENEA Challenge 2020	link
2022	Co-speech Generation	English	GENEA Challenge 2022	link
2023	Co-speech Generation	English	GENEA Challenge 2023	link

Acknowledgement

If you find our survey and repository useful for your research project, please consider citing our paper:

@article{liu2023blsurvey,
  title={A Survey on Deep Multi-modal Learning for Body Language Recognition and Generation},
  author={Liu, Li and Lufei, Gao  and Wentao, Lei and Fengji, Ma and Xiaotian, Lin and Jinting, Wang },
  journal={arXiv:2308.08849},
  year={2023}
}

Contact

avrillliu@hkust-gz.edu.cn

wlei117@connect.hkust-gz.edu.cn

wentaoL86/awesome-body-language