Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert (INTERSPEECH 24)
This repository contains the official implementation of the INTERSPEECH 2024 paper, "Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert"
This code was developed on Ubuntu 18.04 with Python 3.8, CUDA 11.3, and Pytorch 1.10.0.
Clone this repo:
git clone https://github.com/postech-ami/3d-talking-head-av-guidance
cd 3d-talking-head-av-guidance
Make a virtual environment:
conda create --name av_guidance python=3.8 -y
conda activate av_guidance
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install pytorch-lightning==1.5.10
pip install hydra-core --upgrade
conda install -c conda-forge ffmpeg
pip install -r requirements.txt
Compile and install psbody-mesh package: MPI-IS/mesh:
BOOST_INCLUDE_DIRS=/usr/lib/x86_64-linux-gnu make all
For your convenience, download the model weight here, and fill in the configuration lipreader_path
with the path of model.
Clone the repository Auto-AVSR in this directory and update the import lines of the below files as auto_avsr/[existing_imports]
. For instance,
# BEFORE
from espnet.nets.pytorch_backend.e2e_asr_conformer_av import E2E
# AFTER
from auto_avsr.espnet.nets.pytorch_backend.e2e_asr_conformer_av import E2E
List of files
espnet/nets/pytorch_backend/backbones/modules/resnet.py
espnet/nets/pytorch_backend/backbones/modules/resnet1d.py
espnet/nets/pytorch_backend/backbones/conv1d_extractor.py
espnet/nets/pytorch_backend/backbones/conv3d_extractor.py
espnet/nets/pytorch_backend/transformer/add_sos_eos.py
espnet/nets/pytorch_backend/transformer/decoder.py
espnet/nets/pytorch_backend/transformer/decoder_layer.py
espnet/nets/pytorch_backend/transformer/encoder_layer.py
espnet/nets/pytorch_backend/transformer/encoder.py
espnet/nets/pytorch_backend/ctc.py
espnet/nets/pytorch_backend/e2e_asr_conformer_av.py
espnet/nets/pytorch_backend/e2e_asr_conformer.py ??
espnet/nets/pytorch_backend/nets_utils.py
espnet/nets/scorers/ctc.py
espnet/nets/scorers/length_bonus.py
espnet/nets/batch_beam_search.py
espnet/nets/beam_search.py
lightning_av.py
Request the VOCASET data from https://voca.is.tue.mpg.de/. Place the downloaded files data_verts.npy
, raw_audio_fixed.pkl
, templates.pkl
and subj_seq_to_idx.pkl
in the folder vocaset/
. Download "FLAME_sample.ply" from VOCA and put it in vocaset/
. Read the vertices/audio data and convert them to .npy/.wav files stored in vocaset/vertices_npy
and vocaset/wav
folder using a script.
Download FLAME model and fill the configuration obj_filename
in config/vocaset.yaml
with the path of head_template.obj
.
Follow the instructions of CodeTalker to preprocess BIWI dataset and put .npy/.wav files into BIWI/vertices_npy
and BIWI/wav
, and the templates.pkl
into BIWI/
folder.
To get the vertex indices of lip geion, download indices list and locate it at BIWI/lve.txt
.
2024.08.24. | Unfortunately, BIWI dataset is not available now.
- To train the model on VOCASET, run:
python main.py --dataset vocaset
The trained models will be saved to outputs/model
.
- To test the model on VOCASET, run:
python test.py --dataset vocaset --test_model_path [path_of_model_weight]
The results will be saved to outputs/pred
. You can download the pretrained model from faceformer_avguidance_vocaset.pth.
- To visualize the results, run:
python render.py --dataset vocaset
The results will be saved to outputs/video
.
- To train the model on BIWI, run:
python main.py --dataset BIWI
The trained models will be saved to outputs/model
- To test the model on BIWI, run:
python test.py --dataset BIWI --test_model_path [path_of_model_weight]
The results will be saved to outputs/pred
. You can download the pretrained model from faceformer_avguidance_biwi.pth.
- To visualize the results, run:
python render.py --dataset BIWI
The results will be saved to outputs/video
.
If you find this code useful for your work, please consider citing:
@inproceedings{eungi24_interspeech,
title = {Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert},
author = {Han EunGi and Oh Hyun-Bin and Kim Sung-Bin and Corentin {Nivelet Etcheberry} and Suekyeong Nam and Janghoon Ju and Tae-Hyun Oh},
year = {2024},
booktitle = {Interspeech 2024},
pages = {2940--2944},
doi = {10.21437/Interspeech.2024-1595},
issn = {2958-1796},
}
We heavily borrow the code from FaceFormer. We sincerely appreciate those authors.