DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion

Yukun Huang, Jianan Wang, Ailing Zeng, Zheng-Jun Zha, Lei Zhang, Xihui Liu

🪄 Introduction

DreamWaltz-G utilizes skeleton-guided 2D diffusion for text-to-3D avatar generation and expressive whole-body animation, which supports diverse applications like shape control & editing, 2D human video reenactment, and 3D Gaussian scene composition.

📢 News

[2024-10-15] 🔥Release the training and inference code.
[2024-10-15] 🔥Release the pre-trained models of 12 full-body 3D Gaussian avatars ready for inference.
[2024-10-15] 🔥Release a dataset for 2D human video reenactment. It comprises 19 human motion scenes with original videos, inpainted videos (where humans are removed), SMPL-X motions, and camera parameters.
[2024-09-26] 📢Publish the arXiv preprint and update the project page.

⚙️ Setup

Please follow the instructions below to get the code and install dependencies.

Clone this repository and navigate to DreamWaltz-G folder:

git clone https://github.com/Yukun-Huang/DreamWaltz-G.git
cd DreamWaltz-G

Install packages. Note that requirements.txt is automatically generated and may not be accurate. We recommend using the provided script for installation:

bash scripts/install.sh

Activate the installed conda environment:

conda activate dreamwaltz

[Optonal] Similar to DreamWaltz, the cuda extensions (heavily borrowed from stable-dreamfusion and latent-nerf) for Instant-NGP are required and will be built at runtime. But if you want to build them manually, the following commands may be useful:

python -m core.nerf.freqencoder.backend
python -m core.nerf.gridencoder.backend
python -m core.nerf.raymarching.rgb.backend
# python -m core.nerf.raymarching.latent.backend  # uncomment this if you want to use Latent-NeRF

🤖 Models

Human Templates (Required for Training and Inference)

Before running the code, you need to prepare the human template models: SMPL-X, FLAME, and VPoser. Please download them from the official project pages: https://smpl-x.is.tue.mpg.de/ and https://flame.is.tue.mpg.de/, then organize them following the structure below:

external
└── human_templates
    ├── smplx
    │   ├── SMPLX_NEUTRAL_2020.npz
    │   ├── FLAME_vertex_ids.npy
    │   ├── MANO_vertex_ids.pkl
    │   └── smplx_vert_segmentation.json
    ├── flame
    │   └── FLAME_masks.pkl
    └── vposer
        └── v2.0
            ├── snapshots
            │   ├── V02_05_epoch=08_val_loss=0.03.ckpt
            │   └── V02_05_epoch=13_val_loss=0.03.ckpt
            ├── V02_05.yaml
            └── V02_05.log

If you already have these models on your machine, you can simply modify the path in configs/path.py to link to them.

Pre-trained Instant-NGP (Required for Training)

DreamWaltz-G adopts a two-stage training pipeline of NeRF→3DGS, where NeRF is initialized with SMPL-X before training. We provide these pre-trained NeRFs (Instant-NGP, specifically) in HuggingFace. You may download and organize them following the structure below:

external
└── human_templates
    ├── instant-ngp
    │   ├── adult_neutral
    │   │   ├── step_005000.pth
    │   │   └── 005000_image.mp4
    ...

In particular, if you want to train them yourself, you can simply run the script:

bash scripts/pretrain_nerf.sh

Pre-trained 3D Avatars (Ready for Inference)

We provide the pre-trained weights of 12 full-body 3D Gaussian avatars, ready for 3D animation and 2D video reenactment without training. You may download them from HuggingFace and organize them following the structure below:

outputs
├── w_expr
│   ├── a_chef_dressed_in_white
│   ├── a_gardener_in_overalls_and_a_wide-brimmed_hat
│   └── ...
└── wo_expr
    ├── a_clown
    ├── black_widow
    └── ...

Unfortunately, due to limitations of DreamWaltz-G and SMPL-X, not all of these avatars support expression control. Specifically, the avatars in w_expr supports expression control (e.g., realistic humans), while the avatars in wo_expr does not support expression control (e.g., fictional characters).

💼 Datasets

As a score distillation-based method, DreamWaltz-G is supervised by a pre-trained 2D diffusion model and requires no training data. The data introduced below is only used for inference.

Expressive 3D Animation

We provide data loaders to read smpl-x motion sequences from four publicly available human motion datasets: Motion-X, TalkSHOW, AIST++, 3DPW. These motion data can be used to animate our 3D avatars for various demos.

To use these datasets, you may download them from the official website and organize them according to the following structure (no need to unzip):

datasets
├── 3DPW
│   ├── readme_and_demo.zip
│   ├── sequenceFiles.zip
│   └── SMPL-X.zip
├── AIST++
│   ├── 20210308_cameras.zip
│   └── 20210308_motions.zip
├── Motion-X
│   └── motionx_smplx.zip
└── TalkShow
    ├── chemistry_pkl_tar.tar.gz
    ├── conan_pkl_tar.tar.gz
    └── ...

For more details, please refer to our code in data/human/.

2D Human Video Reenactment

We build a new dataset from Motion-X for 2D human video reenactment. It comprises 19 human motion scenes with original videos, inpainted videos (where humans are removed), SMPL-X motions, and camera parameters. You may download this dataset from HuggingFace and place it according to the structure below (no need to unzip):

datasets
├── Motion-X-ReEnact
│   └── Motion-X-ReEnact.zip
...

Based on this dataset, the generated 3D avatars can be projected onto 2D inpainted videos to achieve motion reenactment. We hope that this dataset can assist future work in evaluating the human video reenactment task.

💃 Training

To create a full-body 3D avatar from texts with expression control (applicable to realistic humans), you may run the command:

bash scripts/train_w_expr.sh "a chef dressed in white"

To create a full-body 3D avatar from texts without expression control (applicable to most cases), you may run the command:

bash scripts/train_wo_expr.sh "Rapunzel in Tangled"

From our training script, you may notice that we split the two-stage training pipeline into 5 sub-stages, which helps with debugging and ablation analysis.

The whole training takes several hours on a single NVIDIA L40S GPU.

🕺 Inference

Avatars in Canonical Pose

Assuming you have downloaded the pre-trained 3D avatars and placed them correctly, you can run the following scripts to visualize the 3D avatars in their canonical poses:

bash scripts/inference_canonical.sh

The results are saved as images and videos in the respective model directories.

Expressive 3D Animation

Assuming you have downloaded the pre-trained 3D avatars and placed them correctly, you can run the following scripts to animate them using the SMPL-X motion sequences stored in assets/motions/.

For 3D animation using motions from TalkSHOW (w/ expression control), you may run:

bash scripts/inference_talkshow.sh

The results are saved as images and videos in the respective model directories.

For 3D animation using motions from AIST++ (w/o expression control), you may run:

bash scripts/inference_aist.sh

The results are saved as images and videos in the respective model directories.

2D Human Video Reenactment

We also provide an inference script for 2D human video reenactment. Please download our dataset first and place the zip file in datasets/Motion-X-ReEnact/. Once the pre-trained models and data are ready, you may run:

bash scripts/inference_reenact.sh

The results are saved as images and videos in the respective model directories.

🗣️ Discussions

The generation results are not satisfactory and suffer from problems such as over-saturation, partial missing, and blurring.

DreamWaltz-G utilizes stable-diffusion-v1-5 and vanilla SDS for learning 3D representations, and thus inherits the defects of these methods. We recommend adopting more advanced diffusion models and score distillation techniques, such as ControlNeXt and ISM.

Expression control is not accurate, especially for fictional characters.

Even using a 2D diffusion model with face landmark control, learning accurate 3D expression control via score distillation remains challenging. The expression control of DreamWaltz-G is largely benefited from SMPL-X. Therefore, when the face of the generated 3D avatar deviate significantly from the SMPL-X template, the expression control will be inaccurate.

👏 Acknowledgement

This repository is based on many amazing research works and open-source projects: gaussian-splatting, diffusers, stable-dreamfusion, latent-nerf, threestudio, Deformable-3D-Gaussians, diff-gaussian-rasterization, gaussian-mesh-splatting, SuGaR, smplx, etc. Thanks all the authors for their selfless contributions to the community!

😉 Citation

If you find this repository helpful for your work, please consider citing it as follows:

@article{huang2024dreamwaltz-g,
  title={{DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion}},
  author={Huang, Yukun and Wang, Jianan and Zeng, Ailing and Zha, Zheng-Jun and Zhang, Lei and Liu, Xihui},
  year={2024},
  eprint={arXiv preprint arXiv:2409.17145},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
}

@inproceedings{huang2024dreamwaltz,
  title={{DreamWaltz: Make a Scene with Complex 3D Animatable Avatars}},
  author={Huang, Yukun and Wang, Jianan and Zeng, Ailing and Cao, He and Qi, Xianbiao and Shi, Yukai and Zha, Zheng-Jun and Zhang, Lei},
  booktitle={Advances in Neural Information Processing Systems},
  pages={4566--4584},
  year={2023}
}

Yukun-Huang/DreamWaltz-G

DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion

🪄 Introduction

📢 News

⚙️ Setup

🤖 Models

Human Templates (Required for Training and Inference)

Pre-trained Instant-NGP (Required for Training)

Pre-trained 3D Avatars (Ready for Inference)

💼 Datasets

Expressive 3D Animation

2D Human Video Reenactment

💃 Training

🕺 Inference

Avatars in Canonical Pose

Expressive 3D Animation

2D Human Video Reenactment

🗣️ Discussions

The generation results are not satisfactory and suffer from problems such as over-saturation, partial missing, and blurring.

Expression control is not accurate, especially for fictional characters.

Related topics and future explorations.

Please feel free to contact me if you have any questions, thoughts or opportunities for academic collaboration.

👏 Acknowledgement

😉 Citation