/Make-An-Audio-3

Make-An-Audio-3: Transforming Text/Video into Audio via Flow-based Large Diffusion Transformers

Primary LanguagePython

Make-An-Audio 3: Transforming Text into Audio via Flow-based Large Diffusion Transformers

PyTorch Implementation of Lumina-t2x, Lumina-Next

We will provide our implementation and pre-trained models as open-source in this repository recently.

arXiv Hugging Face GitHub Stars

News

Install dependencies

Note: You may want to adjust the CUDA version according to your driver version.

conda create -n Make_An_Audio_3 -y
conda activate Make_An_Audio_3
conda install python=3.11 pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Install [nvidia apex](https://github.com/nvidia/apex) (optional)

Quick Started

Pretrained Models

Simply download the 500M weights from Hugging Face

Model Config Pretraining Data Path
M (160M) txt2audio-cfm-cfg AudioCaption TBD
L (520M) / AudioCaption [TBD]
XL (750M) txt2audio-cfm-cfg-XL AudioCaption Here
XXL txt2audio-cfm-cfg-XXL AudioCaption Here
M (160M) txt2music-cfm-cfg Music Here
L (520M) / Music [TBD]
XL (750M) / Music [TBD]
3B / Music [TBD]
M (160M) video2audio-cfm-cfg-moe VGGSound Here

Generate audio/music from text

python3 scripts/txt2audio_for_2cap_flow.py  --prompt {TEXT}
--outdir output_dir -r  checkpoints_last.ckpt  -b configs/txt2audio-cfm-cfg.yaml --scale 3.0 
--vocoder-ckpt useful_ckpts/bigvnat 

Add --test-dataset structure for text-to-audio generation

Generate audio/music from audiocaps or musiccaps test dataset

  • remember to alter config["test_dataset"]
python3 scripts/txt2audio_for_2cap_flow.py
--outdir output_dir -r  checkpoints_last.ckpt  -b configs/txt2audio-cfm-cfg.yaml --scale 3.0 
--vocoder-ckpt useful_ckpts/bigvnat --test-dataset testset

Generate audio from video

python3 scripts/video2audio_flow.py 
--outdir output_dir -r  checkpoints_last.ckpt  -b configs/video2audio-cfm-cfg-moe.yaml --scale 3.0 
--vocoder-ckpt useful_ckpts/bigvnat --test-dataset vggsound 

Train flow-matching DiT

After trainning VAE, replace model.params.first_stage_config.params.ckpt_path with your trained VAE checkpoint path in the config file. Run the following command to train Diffusion model

python main.py --base configs/txt2audio-cfm-cfg.yaml -t  --gpus 0,1,2,3,4,5,6,7

Others

For Data preparation, Training variational autoencoder, Evaluation, Please refer to Make-An-Audio.

Acknowledgements

This implementation uses parts of the code from the following Github repos: Make-An-Audio, AudioLCM, CLAP, as described in our code.

Citations

If you find this code useful in your research, please consider citing:

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.