Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness

Homepage | Arxiv | Demo Video

1. Experimental setting

The code is tested on NVIDIA GeForce RTX 4090 and CUDA Version: 12.2. The environment is as follows:

conda env create -f environment_priormdm.yml
conda activate PriorMDM
pip install bvh librosa essentia pydub praat-parselmouth torchgeometry moviepy matplotlib==3.1.3
pip install smplx[all]
pip install git+https://github.com/openai/CLIP.git

2. Quick start

Download the pre-trained model from Google Disk or Baidu Disk and place it in the ./save folder.

python -m sample.double_take --save_dir '' --guidacnce_param 1 --model_name model001000000 --BEAT_wav_feat ./datasets/BEAT/my_wav_feat/ --HUMANML3D_text_feat ./datasets/SMPLX/HumanML3D/v3_HUMANML3D_txt_feat/ --clip_model_path ./data/clip --vis_mode customized_controls

Then you can find the generated video in the ./save/my_v3_0/model001000000 folder.

Video result

positions_vis1_1.0_customized_controls.mp4

You can use the following command to generate the video with audio:

python -m sample.double_take --save_dir '' --guidacnce_param 1 --model_name model001000000 --BEAT_wav_feat ./datasets/BEAT/my_wav_feat/ --HUMANML3D_text_feat ./datasets/SMPLX/HumanML3D/v3_HUMANML3D_txt_feat/ --clip_model_path ./data/clip --vis_mode vis_controls
python -m process.merge_mp4_audio --video_file ./save/my_v3_0/model001000000/positions_vis1_1.0_vis_controls.mp4

Video result with audio

positions_vis1_1.0_vis_controls-with-audio.mp4

(Optional) You might use human_body_prior and mdm_motion2smpl.py generate SMPLX motion (without hands/fingers) from the generated file (Note that you need to modify mdm_motion2smpl.py and the environment of human_body_prior is tested on NVIDIA GeForce RTX 2080 Ti and CUDA Version: 12.2):

python ../human_body_prior/tutorials/mdm_motion2smpl.py --input ./save/my_v3_0/model001000000/result_rec_1.0.npy --output ./save/my_v3_0/model001000000/result_rec_1.0_smplx.npz

And then you can use blender to view the SMPLX motion.

Video result converted to SMPLX

0001-0690.mp4

3. Retraining the model

3.1 Data preparation

Text2Motion

Text2motion Text and Mapping we have provided in the ./prepare folder.

Download Text2Motion motion files in SMPLX format from AMASS and place them in the ./datasets/SMPLX/ folder:

python -m prepare.prepare --smplx_folder ./datasets/SMPLX/
cd prepare
unzip texts.zip
python map_index.py --smplx_folder ./datasets/SMPLX/ --processed_motion_path ./datasets/SMPLX/HumanML3D/motion_data/processed/ --processed_text_path ./datasets/SMPLX/HumanML3D/text_data/processed

The total number of text-motion (SMPLX) pairs after processing is 13248.

Audio2Gesture

Download updated BEAT from here.

cd ../process
python BEAT2smplx.py --source_BEAT_path ../datasets/BEAT/beat_english_v0.2.1/ --save_BEAT_smplx_path ../datasets/BEAT/my_smplx

Prepare features

Download the WavLM Large and put it into ./data/wavlm_cache/ folder.

Download SMPL-X Model from here or from 2. Quick start.

# Adjust the orientation of the motion and downsample AMASS dataset
python process_amass.py --source_HumanML3D_motion ../datasets/SMPLX/HumanML3D/motion_data/processed --processed_motion ../datasets/SMPLX/HumanML3D/processed_motion/ --index_path ../prepare/index.csv
# Extract audio/text features and downsample BEAT dataset, split the dataset into train/val/test
bash process_dataset.sh "prepare" "../datasets/BEAT" "../datasets/SMPLX/HumanML3D" "../data/wavlm_cache/WavLM-Large.pt" "../data/clip" "../data/prcocessed_data"
# Convert the motion format of the SMPLX to position, and extract the motion features
bash process_SMPLX.sh "../human_body_prior/support_data/dowloads/models/" '../datasets/BEAT/my_downsample' "../datasets/SMPLX/HumanML3D/"
# Generate h5 file and calculate the statistics of the motion
bash process_dataset.sh "generate_h5_file" "../datasets/BEAT" "../datasets/SMPLX/HumanML3D" "../data/wavlm_cache/WavLM-Large.pt" "../data/clip" "../data/prcocessed_data"

After this step, you should get v3_train.h5, v3_mean.npy and v3_std.npy in ./data/prcocessed_data fold.

3.2 Training

cd ..
python -m train.train_mdm --save_dir save/my_v3_0 --overwrite --batch_size 256 --n_frames 180 --n_seed 0 --h5file_path ./data/prcocessed_data/v3_train.h5 --statistics_path ./data/prcocessed_data

Then you will get the model in ./save/my_v3_0 fold.

Issues

We noticed that the generated results sometimes have sudden changes in orientation, which may be related to the diversity of character motions in HUMANML3D, which may be optimized by data preprocessing or by better motion representation.

Bibtex

If you find this code useful in your research, please cite:

@inproceedings{
yang2024Freetalker,
title={Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness},
author={Sicheng Yang and Zunnan Xu and Haiwei Xue and Yongkang Cheng and Shaoli Huang and Mingming Gong and Zhiyong Wu},
booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
year={2024},
}

If you have any problem, please raise an issue or contact me at youngseng@qq.com.

YoungSeng/FreeTalker