Han Yang, Kun Su, Yutong Zhang, Jiaben Chen, Kaizhi Qian, Gaowen Liu, Chuang Gan
This is the official repository of UniMuMo, a unified music, motion and text generation model. In this repository, we present model and data processing code, as well as the model weights.
# clone project
git clone https://github.com/hanyangclarence/UniMuMo
# create conda environment
cd UniMuMo
conda create -n unimumo python=3.9
conda activate unimumo
# install dependencies
pip install torch==1.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt
pip install madmom==0.16.1
The weight of UniMuMo consists of three parts: a music VQ-VAE, a motion VQ-VAE and a music-motion LM. For inference, please download the unified weight that includes all three parts from here. For data preprocessing or training, only one or two parts of them are required for each stage. So please download the separate weights from here.
After downloaded, please put the weights into folder pretrained
For testing the generation results, run the following command:
python generate.py --help
--ckpt The path to the trained model
# about conditioning
-mu_p --music_path The path to the music to be conditioned on
-mo_p --motion_path The path to the motion to be conditioned on
-mu_d, --music_description
The conditional description for music
-mo_d, --motion_description
The conditional description for motion
-t, --generation_target {mu,mo,mumo,text}
The output format to generate, choosing from music (mu), motion (mo), joint music motion (mumo)
and text description (text)
# about generation settings
-gs, --guidance_scale
Guidance scale (Large => better quality and relavancy to text; Small => better diversity)
--temperature Temperature for generation
-d, --duration Generated music/motion time, default is 10.0
--seed Change this value (any integer number) will lead to a different generation result
-bs, --batch_size Number of samples to generate for each prompt each time
--music_meta_dir The path to music metadata, for loading optional text prompts, default is ./data/music
-s --save_path The folder path to save model output
Conditions and generation target and be set arbitrarily, for example:
# generate music and motion without specific conditions
python generate.py --ckpt path_to_weight -t mumo
# generate music and motion with music text description
python generate.py --ckpt path_to_weight -t mumo -mu_d descriptions_for_music
# generate music conditioned on motion and text
python generate.py --ckpt path_to_weight -t mu -mu_d descriptions_for_music -mo_p path_to_motion_condition
# generate music and motion captions
python generate.py --ckpt path_to_weight -t text -mu_p path_to_music_condition -mo_p path_to_motion_condition
For loading the model, here is an example:
from unimumo.models import UniMuMo
from unimumo.motion.utils import visualize_music_motion
model = UniMuMo.from_checkpoint('path_to_checkpoint', device='cuda')
waveform_gen, motion_gen = model.generate_music_motion()
visualize_music_motion(waveform_gen, motion_gen['joint'], 'gen_results', model.motion_fps)
The default training and inference code organizes the data and files into the following structure.
Show Full Tree Structure
UniMuMo_Project
| generate.py
| README.md
| requirements.txt
| train.py
|
+---assets
|
+---configs # all configurations and hyperparameters for the three training stage
| train_caption.yaml
| train_motion_vqvae.yaml
| train_music_motion.yaml
|
+---data # store the training data and metadata
| +---motion
| | | aist_test.txt # dataset split for all three motion datasets
| | | aist_train.txt
| | | aist_val.txt
| | | dancedb_test.txt
| | | dancedb_train.txt
| | | dancedb_val.txt
| | | humanml3d_test.txt
| | | humanml3d_train.txt
| | | humanml3d_val.txt
| | | ignore_list.txt
| | | Mean.npy # mean and std calculated on the three datasets
| | | Std.npy
| | | test_length.pickle # motion sequence length
| | | train_length.pickle
| | | val_length.pickle
| | |
| | +---aligned_motion_code # the folder for all extracted motion codes that are aligned with music, generated by preprocessing/get_aligned_motion_code.py
| | +---humanml3d_text_description # the folder for all HumanML3D text description txt files
| | +---test # all motion features of shape (T, 263). Train, test and val folder have the same structure
| | | \---joint_vecs
| | +---train
| | \---val
| |
| \---music
| | music4all_captions_gpt.json # music captions generated by ChatGPT and Mu-LLaMa
| | music4all_captions_mullama.json
| | music4all_captions_mullama_val_test.json
| | music4all_ignore.txt
| | music4all_metadata.csv # the metadata modified from music4all dataset
| | music4all_test.txt # our split for music4all
| | music4all_train.txt
| | music4all_val.txt
| | musiccaps-public.csv # the downloaded musiccaps test data
| |
| +---audios # the folder for all music4all .mp3 or .wav files
| +---music_beat # the folder for detected music beat, generated by preprocessing/extract_music_code_beat.py
| \---music_codes # the folder for extracted music code, generated by preprocessing/extract_music_code_beat.py
|
+---preprocessing
| extract_music_code_beat.py # extract music code with Encodec and detect music beat
| get_aligned_motion_code.py # align each music track with several motion sequences
| get_text_prompt.py # get music captions from ChatGPT
|
+---test_model
| demo_motiontext2music.py
| demo_musictext2motion.py
| demo_music_motion_alignment_60hz.py
| demo_t2mm.py
| test_motion_vqvae.py
| test_motion2music_aist.py
| test_motion2text_pad.py
| test_music2motion_aist.py
| test_music2text_mullama.py
| test_musiccaps.py
|
\---unimumo # the main code for UniMuMo
Please refer to the website of Music4All to download the dataset.
After downloaded, put the audio files in folder data/music/audios
.
Please download HumanML3D, AIST++
and DanceDB according to their instructions. After downloaded, please put the data and metadata into folder data/motion
. Note that we have provided Mean.npy
and Std.npy
for motion features, which is calculated across all three datasets. Don't overwrite it with the mean and std from HumanML3D dataset.
We use Demucs for splitting music and vocal.
To speed up training, we use Encodec to
extract all the music codes and use drum-aware4beat to track
the music beat before training. Please set the correct data path in preprocessing/extract_music_code_beat.py
and run:
python preprocessing/extract_music_code_beat.py --start 0.0 --end 1.0
Since this process takes a long time, if you have multiple machines, you can split the work by setting --start
and
--end
to specify the start and end point of each job.
Please first check the settings in configs/train_motion_vqvae.yaml
, e.g., the paths of datasets, number of device and node.
Then run:
python train.py --stage train_vqvae --base configs/train_motion_vqvae.yaml
Resuming training can be achieved by appending -r path_to_previous_checkpoint
to above command.
After training the motion VQ-VAE, we use Dynamic Time Warping to pair each music track with several motions and extract the motion codes from the augmented motion sequences prior to training the music-motion LM. Please first set the correct data paths and run:
python preprocessing/get_aligned_motion_code.py --start 0.0 --end 1.0
You can also set --start
and --end
to manually distribute the work.
Please first check the settings in configs/train_lm.yaml
, and run:
python train.py --stage train_music_motion --base configs/train_lm.yaml
Similarly, training can be resumed by appending -r path_to_previous_checkpoint
.
Please run:
python train.py --stage train_caption --mm_ckpt path_to_last_stage_model --base configs/train_lm.yaml
Note that it is required to provide the checkpoint of previous stage in --mm_ckpt
, since the captioning model is built on
the trained music-motion LM.
Finally, we have three separate model checkpoints: an Encodec, a motion VQ-VAE and a music-motion LM. We combine them into
a single checkpoint that can be directly loaded by class UniMuMo
by running:
python unimumo/merge_model_checkpoints.py (provide the paths for all the checkpoints, configs and metadata...)
All the scripts for testing the model in large-scale are in test_model
folder, start with "test_". The name of each script signifies the task it works on.
For example, to test the reconstruction loss of the motion VQ-VAE, run
python test_model/test_motion_vqvae.py --save_dir path_to_save_destination --ckpt path_to_model_ckpt
To test the model on MusicCaps, run
python test_model/test_musiccaps.py --save_dir path_to_save_destination --ckpt path_to_model_ckpt
and you can also set start
and end
config to split the job.
As described in the paper in detail, we directly adopt the evaluation metrics from various repos. Please refer to the paper for further guide on running each metrics.
Our code is partially built on the following repositories: Audiocraft, Stable Diffusion, drum-aware4beat and T2M-GPT. Thanks to their great work!