PyTorch Implementation of Lumina-t2x, Lumina-Next
We will provide our implementation and pre-trained models as open-source in this repository recently.
- June, 2024: Make-An-Audio-3 (Lumina-Next) released in Github.
Note: You may want to adjust the CUDA version according to your driver version.
conda create -n Make_An_Audio_3 -y
conda activate Make_An_Audio_3
conda install python=3.11 pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Install [nvidia apex](https://github.com/nvidia/apex) (optional)
Simply download the 500M weights from
Model | Config | Pretraining Data | Path |
---|---|---|---|
M (160M) | txt2audio-cfm-cfg | AudioCaption | TBD |
L (520M) | / | AudioCaption | [TBD] |
XL (750M) | txt2audio-cfm-cfg-XL | AudioCaption | Here |
XXL | txt2audio-cfm-cfg-XXL | AudioCaption | Here |
M (160M) | txt2music-cfm-cfg | Music | Here |
L (520M) | / | Music | [TBD] |
XL (750M) | / | Music | [TBD] |
3B | / | Music | [TBD] |
M (160M) | video2audio-cfm-cfg-moe | VGGSound | Here |
python3 scripts/txt2audio_for_2cap_flow.py --prompt {TEXT}
--outdir output_dir -r checkpoints_last.ckpt -b configs/txt2audio-cfm-cfg.yaml --scale 3.0
--vocoder-ckpt useful_ckpts/bigvnat
Add --test-dataset structure
for text-to-audio generation
- remember to alter
config["test_dataset"]
python3 scripts/txt2audio_for_2cap_flow.py
--outdir output_dir -r checkpoints_last.ckpt -b configs/txt2audio-cfm-cfg.yaml --scale 3.0
--vocoder-ckpt useful_ckpts/bigvnat --test-dataset testset
python3 scripts/video2audio_flow.py
--outdir output_dir -r checkpoints_last.ckpt -b configs/video2audio-cfm-cfg-moe.yaml --scale 3.0
--vocoder-ckpt useful_ckpts/bigvnat --test-dataset vggsound
After trainning VAE, replace model.params.first_stage_config.params.ckpt_path with your trained VAE checkpoint path in the config file. Run the following command to train Diffusion model
python main.py --base configs/txt2audio-cfm-cfg.yaml -t --gpus 0,1,2,3,4,5,6,7
For Data preparation, Training variational autoencoder, Evaluation, Please refer to Make-An-Audio.
This implementation uses parts of the code from the following Github repos: Make-An-Audio, AudioLCM, CLAP, as described in our code.
If you find this code useful in your research, please consider citing:
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.