FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing

Mingyuan Zhang¹ Huirong Li¹ Zhongang Cai^1,2 Jiawei Ren¹ Lei Yang² Ziwei Liu¹⁺

¹S-Lab, Nanyang Technological University ²SenseTime Research

⁺corresponding author

[Project Page] • [PDF]

Accepted to NeurIPS 2023

Abstract: Text-driven motion generation has achieved substantial progress with the emergence of diffusion models. However, existing methods still struggle to generate complex motion sequences that correspond to fine-grained descriptions, depicting detailed and accurate spatio-temporal actions. This lack of fine controllability limits the usage of motion generation to a larger audience. To tackle these challenges, we present FineMoGen, a diffusion-based motion generation and editing framework that can synthesize fine-grained motions, with spatial-temporal composition to the user instructions. To facilitate a large-scale study on this new fine-grained motion generation task, we also contribute the HuMMan-MoGen dataset, which contains fine-grained description for each body part and each action stage.

Pipeline Overview: FineMoGen builds upon diffusion model with a novel transformer architecture dubbed Spatio-Temporal Mixture Attention (SAMI). SAMI optimizes the generation of the global attention template from three perspectives: 1) Temporal Independence: we regard each global template as a time-varied signal, which allows us to extrapolate the feature refinement between different time intervals. 2) Spatial Independence: we manually divide the raw motion data into several body parts, process them independently in FFN modules and apply sptial refinement in SAMI modules. 3) Sparsely-Activated Mixture-of-Expert: we broaden the overall network structure to enhance learning capability and overcome the training difficulties brought by spatial-temporal independence modelling.

Updates

[12/2023] Release code for FineMoGen, MoMat-MoGen, ReMoDiffuse and MotionDiffuse

Benchmark and Model Zoo

Supported methods

Citation

If you find our work useful for your research, please consider citing the paper:

@article{zhang2023finemogen,
  title={FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing},
  author={Zhang, Mingyuan and Li, Huirong and Cai, Zhongang and Ren, Jiawei and Yang, Lei and Liu, Ziwei},
  journal={NeurIPS},
  year={2023}
}
@article{zhang2023remodiffuse,
  title={ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model},
  author={Zhang, Mingyuan and Guo, Xinying and Pan, Liang and Cai, Zhongang and Hong, Fangzhou and and Yang, Lei and Liu, Ziwei},
  journal={arXiv preprint arXiv:2304.01116},
  year={2023}
}
@article{zhang2022motiondiffuse,
  title={MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model},
  author={Zhang, Mingyuan and Cai, Zhongang and Pan, Liang and Hong, Fangzhou and Guo, Xinying and Yang, Lei and Liu, Ziwei},
  journal={arXiv preprint arXiv:2208.15001},
  year={2022}
}

Installation

# Create Conda Environment
conda create -n mogen python=3.9 -y
conda activate mogen

# C++ Environment
export PATH=/mnt/lustre/share/gcc/gcc-8.5.0/bin:$PATH
export LD_LIBRARY_PATH=/mnt/lustre/share/gcc/gcc-8.5.0/lib:/mnt/lustre/share/gcc/gcc-8.5.0/lib64:/mnt/lustre/share/gcc/gmp-4.3.2/lib:/mnt/lustre/share/gcc/mpc-0.8.1/lib:/mnt/lustre/share/gcc/mpfr-2.4.2/lib:$LD_LIBRARY_PATH

# Install Pytorch
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch -y

# Install MMCV
pip install "mmcv-full>=1.4.2,<=1.9.0" -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.12.1/index.html

# Install Pytorch3d
conda install -c bottler nvidiacub -y
conda install -c fvcore -c iopath -c conda-forge fvcore iopath -y
conda install pytorch3d -c pytorch3d -y

# Install tutel
python3 -m pip install --verbose --upgrade git+https://github.com/microsoft/tutel@main

# Install other requirements
pip install -r requirements.txt

Data Preparation

Download data files from google drive link. Unzipped all files and arrange them in the following file structure:

FineMoGen
├── mogen
├── tools
├── configs
├── logs
│   ├── finemogen
│   ├── motiondiffuse
│   ├── remodiffuse
│   └── mdm
└── data
    ├── database
    ├── datasets
    ├── evaluators
    └── glove

Training

Training with a single / multiple GPUs

PYTHONPATH=".":$PYTHONPATH python tools/train.py ${CONFIG_FILE} ${WORK_DIR} --no-validate

Note: The provided config files are designed for training with 8 gpus. If you want to train on a single gpu, you can reduce the number of epochs to one-fourth of the original.

Training with Slurm

./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} ${GPU_NUM} --no-validate

Common optional arguments include:

--resume-from ${CHECKPOINT_FILE}: Resume from a previous checkpoint file.
--no-validate: Whether not to evaluate the checkpoint during training.

Example: using 8 GPUs to train ReMoDiffuse on a slurm cluster.

./tools/slurm_train.sh my_partition my_job configs/finemogen/finemogen_kit.py logs/finemogen_kit 8 --no-validate

Evaluation

Evaluate with a single GPU / multiple GPUs

PYTHONPATH=".":$PYTHONPATH python tools/test.py ${CONFIG} --work-dir=${WORK_DIR} ${CHECKPOINT}

Evaluate with slurm

./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG} ${WORK_DIR} ${CHECKPOINT}

Example:

./tools/slurm_test.sh my_partition test_finemogen configs/finemogen/finemogen_kit.py logs/finemogen/finemogen_kit logs/finemogen/finemogen_kit/latest.pth

Note: Run full evaluation for HumanML3D dataset is very slow. You can change replication_times in human_ml3d_bs128.py to $1$ for a quick evaluation.

Visualization

Visualization for a single motion

PYTHONPATH=".":$PYTHONPATH python tools/visualize.py ${CONFIG} ${CHECKPOINT} \
    --text ${TEXT} \
    --motion_length ${MOTION_LENGTH} \
    --out ${OUTPUT_ANIMATION_PATH} \
    --device cpu

Example:

PYTHONPATH=".":$PYTHONPATH python tools/visualize.py \
    configs/remodiffuse/remodiffuse_t2m.py \
    logs/finemogen/finemogen_t2m/latest.pth \
    --text "a person is running quickly" \
    --motion_length 120 \
    --out "test.gif" \
    --device cpu

Visualization for temporal composition

PYTHONPATH=".":$PYTHONPATH python tools/visualize.py ${CONFIG} ${CHECKPOINT} \
    --text ${TEXT1} ${TEXT2} ${TEXT3} ... \
    --motion_length ${MOTION_LENGTH1} ${MOTION_LENGTH2} ${MOTION_LENGTH3} ... \
    --out ${OUTPUT_ANIMATION_PATH} \
    --device cpu

Example:

PYTHONPATH=".":$PYTHONPATH python tools/visualize.py \
    configs/finemogen/finemogen_t2m.py \
    logs/finemogen/finemogen_t2m/latest.pth \
    --text "a person walks 4 steps forward" "a person stops and looks around" "a perons sits down" "a person cries" "a person jumps backward"  \
    --motion_length 60 60 60 60 60 \
    --out "test.gif"

Acknowledgement

This study is supported by the Ministry of Education, Singapore, under its MOE AcRF Tier 2 (MOE-T2EP20221-0012), NTU NAP, and under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

The visualization tool is developed on top of Generating Diverse and Natural 3D Human Motions from Text

mingyuan-zhang/FineMoGen

FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing

[Project Page] • [PDF] Accepted to NeurIPS 2023

Updates

Benchmark and Model Zoo

Supported methods

Citation

Installation

Data Preparation

Training

Training with a single / multiple GPUs

Training with Slurm

Evaluation

Evaluate with a single GPU / multiple GPUs

Evaluate with slurm

Visualization

Visualization for a single motion

Visualization for temporal composition

Acknowledgement

[Project Page] • [PDF]

Accepted to NeurIPS 2023