SAiD: Blendshape-based Audio-Driven Speech Animation with Diffusion

This is the code for SAiD: Blendshape-based Audio-Driven Speech Animation with Diffusion.

Installation

Run the following command to install it as a pip module:

pip install .

If you are developing this repo or want to run the scripts, run instead:

pip install -e .[dev]

If there is an error related to pyrender, install additional packages as follows:

apt-get install libboost-dev libglfw3-dev libgles2-mesa-dev freeglut3-dev libosmesa6-dev libgl1-mesa-glx

Directories

data: It contains data used for preprocessing and training.
model: It contains the weights of VAE, which is used for the evaluation.
blender-addon: It contains the blender addon that can visualize the blendshape coefficients.
script: It contains Python scripts for preprocessing, training, inference, and evaluation.
static: It contains the resources for the project page.

Inference

You can download the pretrained weights of SAiD from Hugging Face Repo.

python script/inference.py \
        --weights_path "<SAiD_weights>.pth" \
        --audio_path "<input_audio>.wav" \
        --output_path "<output_coeffs>.csv" \
        [--init_sample_path "<input_init_sample>.csv"] \  # Required for editing
        [--mask_path "<input_mask>.csv"]  # Required for editing

BlendVOCA

Construct Blendshape Facial Model

Due to the license issue of VOCASET, we cannot distribute BlendVOCA directly. Instead, you can preprocess data/blendshape_residuals.pickle after constructing BlendVOCA directory as follows for the simple execution of the script.

├─ audio-driven-speech-animation-with-diffusion
│  ├─ ...
│  └─ script
└─ BlendVOCA
   └─ templates
      ├─ ...
      └─ FaceTalk_170915_00223_TA.ply

templates: Download the template meshes from VOCASET.

python script/preprocess_blendvoca.py \
        --blendshapes_out_dir "<output_blendshapes_dir>"

If you want to generate blendshapes by yourself, do the folowing instructions.

Unzip data/ARKit_reference_blendshapes.zip.
Download the template meshes from VOCASET.
Crop template meshes using data/FLAME_head_idx.txt. You can crop more indices and then restore them after finishing the construction process.
Use Deformation-Transfer-for-Triangle-Meshes to construct the blendshape meshes.
- Use data/ARKit_landmarks.txt and data/FLAME_head_landmarks.txt as marker vertices.
- Find the correspondance map between neutral meshes, and use it to transfer the deformation of arbitrary meshes.
Create blendshape_residuals.pickle, which contains the blendshape residuals in the following Python dictionary format. Refer to data/blendshape_residuals.pickle.
```
{
    'FaceTalk_170731_00024_TA': {
        'jawForward': <np.ndarray object with shape (V, 3)>,
        ...
    },
    ...
}
```

Generate Blendshape Coefficients

You can simply unzip data/blendshape_coeffcients.zip.

If you want to generate coefficients by yourself, we recommend constructing the BlendVOCA directory as follows for the simple execution of the script.

├─ audio-driven-speech-animation-with-diffusion
│  ├─ ...
│  └─ script
└─ BlendVOCA
   ├─ blendshapes_head
   │  ├─ ...
   │  └─ FaceTalk_170915_00223_TA
   │     ├─ ...
   │     └─ noseSneerRight.obj
   ├─ templates_head
   │  ├─ ...
   │  └─ FaceTalk_170915_00223_TA.obj
   └─ unposedcleaneddata
      ├─ ...
      └─ FaceTalk_170915_00223_TA
         ├─ ...
         └─ sentence40

blendshapes_head: Place the constructed blendshape meshes (head).
templates_head: Place the template meshes (head).
unposedcleaneddata: Download the mesh sequences (unposed cleaned data) from VOCASET.

And then, run the following command:

python script/optimize_blendshape_coeffs.py \
        --blendshapes_coeffs_out_dir "<output_coeffs_dir>"

After generating blendshape coefficients, create coeffs_std.csv, which contains the standard deviation of each coefficients. Refer to data/coeffs_std.csv.

jawForward,...
<std_jawForward>,...

Training / Evaluation on BlendVOCA

Dataset Directory Setting

We recommend constructing the BlendVOCA directory as follows for the simple execution of scripts.

├─ audio-driven-speech-animation-with-diffusion
│  ├─ ...
│  └─ script
└─ BlendVOCA
   ├─ audio
   │  ├─ ...
   │  └─ FaceTalk_170915_00223_TA
   │     ├─ ...
   │     └─ sentence40.wav
   ├─ blendshape_coeffs
   │  ├─ ...
   │  └─ FaceTalk_170915_00223_TA
   │     ├─ ...
   │     └─ sentence40.csv
   ├─ blendshapes_head
   │  ├─ ...
   │  └─ FaceTalk_170915_00223_TA
   │     ├─ ...
   │     └─ noseSneerRight.obj
   └─ templates_head
      ├─ ...
      └─ FaceTalk_170915_00223_TA.obj

audio: Download the audio from VOCASET.
blendshape_coeffs: Place the constructed blendshape coefficients.
blendshapes_head: Place the constructed blendshape meshes (head).
templates_head: Place the template meshes (head).

Training VAE, SAiD

Train VAE

python script/train_vae.py \
        --output_dir "<output_logs_dir>" \
        [--coeffs_std_path "<coeffs_std>.txt"]

Train SAiD

python script/train.py \
        --output_dir "<output_logs_dir>"

Evaluation

Generate SAiD outputs on the test speech data

python script/test_inference.py \
        --weights_path "<SAiD_weights>.pth" \
        --output_dir "<output_coeffs_dir>"

Remove FaceTalk_170809_00138_TA/sentence32-xx.csv files from the output directory. Ground-truth data does not contain the motion data of FaceTalk_170809_00138_TA/sentence32.

Evaluate SAiD outputs: FD, WInD, and Multimodality.

python script/test_evaluate.py \
        --coeffs_dir "<input_coeffs_dir>" \
        [--vae_weights_path "<VAE_weights>.pth"] \
        [--blendshape_residuals_path "<blendshape_residuals>.pickle"]

We have to generate the videos to compute the AV offset/confidence. To avoid the memory leak issue of the pyrender module, we use the shell script. After updating COEFFS_DIR and OUTPUT_DIR, run the script:
```
# Fix 1: COEFFS_DIR="<input_coeffs_dir>"
# Fix 2: OUTPUT_DIR="<output_video_dir>"
python script/test_render.sh
```
Use SyncNet to compute the AV offset/confidence.

Reference

If you use this code as part of any research, please cite the following paper.

@misc{park2023said,
      title={SAiD: Speech-driven Blendshape Facial Animation with Diffusion},
      author={Inkyu Park and Jaewoong Cho},
      year={2023},
      eprint={2401.08655},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

taichuai/SAiD