Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert (INTERSPEECH 24)

This repository contains the official implementation of the INTERSPEECH 2024 paper, "Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert"

Getting Started

Installation

This code was developed on Ubuntu 18.04 with Python 3.8, CUDA 11.3, and Pytorch 1.10.0.

Clone this repo:

git clone https://github.com/postech-ami/3d-talking-head-av-guidance
cd 3d-talking-head-av-guidance

Make a virtual environment:

conda create --name av_guidance python=3.8 -y
conda activate av_guidance
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install pytorch-lightning==1.5.10
pip install hydra-core --upgrade
conda install -c conda-forge ffmpeg
pip install -r requirements.txt 

Compile and install psbody-mesh package: MPI-IS/mesh:

BOOST_INCLUDE_DIRS=/usr/lib/x86_64-linux-gnu make all

Lip reading expert

For your convenience, download the model weight here, and fill in the configuration lipreader_path with the path of model.

Clone the repository Auto-AVSR in this directory and update the import lines of the below files as auto_avsr/[existing_imports]. For instance,

# BEFORE
from espnet.nets.pytorch_backend.e2e_asr_conformer_av import E2E
# AFTER
from auto_avsr.espnet.nets.pytorch_backend.e2e_asr_conformer_av import E2E
List of files
espnet/nets/pytorch_backend/backbones/modules/resnet.py
espnet/nets/pytorch_backend/backbones/modules/resnet1d.py

espnet/nets/pytorch_backend/backbones/conv1d_extractor.py
espnet/nets/pytorch_backend/backbones/conv3d_extractor.py

espnet/nets/pytorch_backend/transformer/add_sos_eos.py
espnet/nets/pytorch_backend/transformer/decoder.py
espnet/nets/pytorch_backend/transformer/decoder_layer.py
espnet/nets/pytorch_backend/transformer/encoder_layer.py
espnet/nets/pytorch_backend/transformer/encoder.py

espnet/nets/pytorch_backend/ctc.py
espnet/nets/pytorch_backend/e2e_asr_conformer_av.py
espnet/nets/pytorch_backend/e2e_asr_conformer.py ??
espnet/nets/pytorch_backend/nets_utils.py

espnet/nets/scorers/ctc.py
espnet/nets/scorers/length_bonus.py

espnet/nets/batch_beam_search.py
espnet/nets/beam_search.py

lightning_av.py

Datasets

VOCASET

Request the VOCASET data from https://voca.is.tue.mpg.de/. Place the downloaded files data_verts.npy, raw_audio_fixed.pkl, templates.pkl and subj_seq_to_idx.pkl in the folder vocaset/. Download "FLAME_sample.ply" from VOCA and put it in vocaset/. Read the vertices/audio data and convert them to .npy/.wav files stored in vocaset/vertices_npy and vocaset/wav folder using a script.

Download FLAME model and fill the configuration obj_filename in config/vocaset.yaml with the path of head_template.obj.

BIWI

Follow the instructions of CodeTalker to preprocess BIWI dataset and put .npy/.wav files into BIWI/vertices_npy and BIWI/wav, and the templates.pkl into BIWI/ folder.

To get the vertex indices of lip geion, download indices list and locate it at BIWI/lve.txt.

2024.08.24. | Unfortunately, BIWI dataset is not available now.

Training and Testing on VOCASET

  • To train the model on VOCASET, run:
python main.py --dataset vocaset

The trained models will be saved to outputs/model.

  • To test the model on VOCASET, run:
python test.py --dataset vocaset --test_model_path [path_of_model_weight]

The results will be saved to outputs/pred. You can download the pretrained model from faceformer_avguidance_vocaset.pth.

  • To visualize the results, run:
python render.py --dataset vocaset

The results will be saved to outputs/video.

Training and Testing on BIWI

  • To train the model on BIWI, run:
python main.py --dataset BIWI

The trained models will be saved to outputs/model

  • To test the model on BIWI, run:
python test.py --dataset BIWI --test_model_path [path_of_model_weight]

The results will be saved to outputs/pred. You can download the pretrained model from faceformer_avguidance_biwi.pth.

  • To visualize the results, run:
python render.py --dataset BIWI

The results will be saved to outputs/video.

Citation

If you find this code useful for your work, please consider citing:

@inproceedings{eungi24_interspeech,
  title     = {Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert},
  author    = {Han EunGi and Oh Hyun-Bin and Kim Sung-Bin and Corentin {Nivelet Etcheberry} and Suekyeong Nam and Janghoon Ju and Tae-Hyun Oh},
  year      = {2024},
  booktitle = {Interspeech 2024},
  pages     = {2940--2944},
  doi       = {10.21437/Interspeech.2024-1595},
  issn      = {2958-1796},
}

Acknowledgement

We heavily borrow the code from FaceFormer. We sincerely appreciate those authors.