/S2G-MDDiffusion

Primary LanguagePythonMIT LicenseMIT

[CVPR'24] Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Xu He1 · Qiaochu Huang1 · Zhensong Zhang2 · Zhiwei Lin1 · Zhiyong Wu1,4 ·
Sicheng Yang1 · Minglei Li3 · Zhiyi Chen3 · Songcen Xu2 · Xiaofei Wu2 ·
1Shenzhen International Graduate School, Tsinghua University     2Huawei Noah’s Ark Lab
3Huawei Cloud Computing Technologies Co., Ltd     4The Chinese University of Hong Kong

Paper Arxiv     Project Page    Youtube

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

📣 News

  • [2024.05.06] Release training and inference code with instructions to preprocess the PATS dataset.

  • [2024.03.25] Release paper.

🗒 TODOs

  • Release data preprocessing code.
  • Release inference code.
  • Release pretrained weights.
  • Release training code.
  • Release code about evaluation metrics.
  • Release the presentation video.

⚒️ Environment

We recommend a python version >=3.7 and cuda version =11.7. It's possible to have other compatible version.

conda create -n MDD python=3.7
conda activate MDD
pip install -r requirements.txt

We test our code on NVIDIA A10, NVIDIA A100, NVIDIA GeForce RTX 4090.

⭕ Quick Start

Download our trained weights including motion_decoupling.pth.tar and motion_diffusion.pt from Baidu Netdisk. Put them in the inference/ckpt folder.

Download WavLM Large model and put it into the inference/data/wavlm folder.

Now, get started with the following code:

cd inference
CUDA_VISIBLE_DEVICES=0 python inference.py --wav_file ./assets/001.wav --init_frame ./assets/001.png --use_motion_selection

📊 Data Preparation

Due to copyright considerations, we are unable to directly provide the preprocessed data subset mentioned in our paper. Instead, we provide the filtered interval ids and preparation instructions.

To get started, please download the meta file cmu_intervals_df.csv provided by PATS (you can fint it in any zip file) and put it in the data-preparation folder. Then run the following code to prepare the data.

cd data-preparation
bash prepare_data.sh

After running the above code, you will get the following folder structure containing the preprocessed data:

|--- data-preparation
|    |--- data
|    |    |--- img
|    |    |    |--- train
|    |    |    |    |--- chemistry#99999.mp4
|    |    |    |    |--- oliver#88888.mp4
|    |    |    |--- test
|    |    |    |    |--- jon#77777.mp4
|    |    |    |    |--- seth#66666.mp4
|    |    |--- audio
|    |    |    |--- chemistry#99999.wav
|    |    |    |--- oliver#88888.wav
|    |    |    |--- jon#77777.wav
|    |    |    |--- seth#66666.wav

🔥 Train Your Own Model

Here we use accelerate for distributed training.

Train the Motion Decoupling Module

Change into the stage1 folder:

cd stage1

Then run the following code to train the motion decoupling module:

accelerate launch run.py --config config/stage1.yaml --mode train

Checkpoints be saved in the log folder, denoted as stage1.pth.tar, which will be used to extract the keypoint features:

CUDA_VISIBLE_DEVICES=0 python run_extraction.py --config config/stage1.yaml --mode extraction --checkpoint log/stage1.pth.tar --device_ids 0 --train
CUDA_VISIBLE_DEVICES=0 python run_extraction.py --config config/stage1.yaml --mode extraction --checkpoint log/stage1.pth.tar --device_ids 0 --test

And the extracted motion features will save in the feature folder.

Train the Latent Motion Diffusion Module

Change into the stage2 folder:

cd ../stage2

Download WavLM Large model and put it into the data/wavlm folder. Then slice and preprocess the data:

cd data 
python create_dataset_gesture.py --stride 0.4 --length 3.2 --keypoint_folder ../stage1/feature ----wav_folder ../data-preparation/data/audio --extract-baseline --extract-wavlm
cd ..

Run the following code to train the latent motion diffusion module:

accelerate launch train.py

Training the Refinement Network

Change into the stage3 folder:

cd ../stage3

Download mobile_sam.pt provided by MobileSAM and put it in the pretrained_weights folder. Then extract bounding boxes of hands for weighted loss (only training set needed):

python get_bbox.py --img_dir ../data-preparation/data/img/train

Now you can train the refinement network:

accelerate launch run.py --config config/stage3.yaml --mode train --tps_checkpoint ../stage1/log/stage1.pth.tar

✏️ Citing

If you find our work useful, please consider citing:

@inproceedings{he2024co,
  title={Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model},
  author={He, Xu and Huang, Qiaochu and Zhang, Zhensong and Lin, Zhiwei and Wu, Zhiyong and Yang, Sicheng and Li, Minglei and Chen, Zhiyi and Xu, Songcen and Wu, Xiaofei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2263--2273},
  year={2024}
}

🙏 Acknowledgments

Our code follows several excellent repositories. We appreciate them for making their codes available to the public.