We added some extra descriptions upon the official AudioLDM repository.
There is no modification in the code.
You can search for "2024" to see the added sections in README.
Please also refer to the code in our baseline ipynb in Google Colab for submission.
For other inquiries, please contact the author of the code and paper (AudioLDM).
We sincerely thank to the authors for sharing the official code and facilitating the advancement of academia.
# Create conda environment
conda create -n audioldm_train python=3.10
conda activate audioldm_train
# Clone the repo
git clone https://github.com/haoheliu/AudioLDM-training-finetuning.git; cd AudioLDM-training-finetuning
# Install running environment
pip install poetry
poetry install
- Download checkpoints from Google Drive: link. The checkpoints including pretrained VAE, AudioMAE, CLAP, 16kHz HiFiGAN, and 48kHz HiFiGAN.
- Uncompress the checkpoint tar file and place the content into data/checkpoints/
- Download the preprocessed AudioCaps from Google Drive: link
- Similarly, uncompress the dataset tar file and place the content into data/dataset
To double check if dataset or checkpoints are ready, run the following command:
python3 tests/validate_dataset_checkpoint.py
If the structure is not correct or partly missing. You will see the error message.
Make your own config yaml file before training/finetuning.
You can change the model structure and learning details.
Refer to the examples for variants of AudioLDM.
AudioLDM1 and AudioLDM2 may also help.
# Train the AudioLDM (latent diffusion part)
python3 audioldm_train/train/latent_diffusion.py -c audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original.yaml
# Train the VAE (Optional)
# python3 audioldm_train/train/autoencoder.py -c audioldm_train/config/2023_11_13_vae_autoencoder/16k_64.yaml
The program will perform generation on the evaluation set every 5 epochs of training. After obtaining the audio generation folders (named val_), you can proceed to the next step for model evaluation.
You can finetune with two pretrained checkpoint, first download the one that you like (e.g., using wget):
- Medium size AudioLDM: https://zenodo.org/records/7884686/files/audioldm-m-full.ckpt
- Small size AudioLDM: https://zenodo.org/records/7884686/files/audioldm-s-full
Place the checkpoint in the data/checkpoints folder
Then perform finetuning with one of the following commands:
# Medium size AudioLDM
python3 audioldm_train/train/latent_diffusion.py -c audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original_medium.yaml --reload_from_ckpt data/checkpoints/audioldm-m-full.ckpt
# Small size AudioLDM
python3 audioldm_train/train/latent_diffusion.py -c audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original.yaml --reload_from_ckpt data/checkpoints/audioldm-s-full
You can specify your own dataset following the same format as the provided AudioCaps dataset.
Note that the pretrained AudioLDM checkpoints are under CC-by-NC 4.0 license, which is not allowed for commerial use.
Automatically evaluation based on each of the folder with generated audio
# Evaluate all existing generated folder
python3 audioldm_train/eval.py --log_path all
# Evaluate only a specific experiment folder
python3 audioldm_train/eval.py --log_path <path-to-the-experiment-folder>
The evaluation result will be saved in a json file at the same level of the audio folder.
Use the following syntax:
python3 audioldm_train/infer.py --config_yaml <The-path-to-the-same-config-file-you-use-for-training> --list_inference <the-filelist-you-want-to-generate>
For example:
# Please make sure you have train the model using audioldm_crossattn_flant5.yaml
# The generated audio will be saved at the same log folder if the pretrained model.
python3 audioldm_train/infer.py --config_yaml audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_crossattn_flant5.yaml --list_inference tests/captionlist/inference_test.lst
The generated audio will be named with the caption by default. If you like to specify the filename to use, please checkout the format of tests/captionlist/inference_test_with_filename.lst.
As the provided pretrained checkpoints (audioldm-m-full, audioldm-s-full) do not include some weights (e.g. cond_stage_model, first_state_model, clap, etc.), it may cause error.
(from the official repo) This repo only support inference with the model you trained by yourself. If you want to use the pretrained model directly, please use these two repos: AudioLDM and AudioLDM2.
Use strict=False option while executing load_state_dict
to make an inference with a pretrained model. Otherwise, do not change the code to prevent unexpected errors.
latent_diffusion.load_state_dict(checkpoint["state_dict"], strict=False)
After then, also specify the checkpoint path with config yaml and list for inference. Make sure you downloaded one of the pretrained checkpoints mentioned in Section Finetuning of the pretrained model. For instance,
python3 audioldm_train/infer.py --config_yaml audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original_medium.yaml --list_inference tests/captionlist/inference_test.lst --reload_from_ckpt data/checkpoints/audioldm-m-full.ckpt
Super easy, simply follow these steps:
- Prepare the metadata with the same format as the provided AudioCaps dataset.
- Register in the metadata of your dataset in data/dataset/metadata/dataset_root.json
- Use your dataset in the YAML file.
You do not need to resample or pre-segment the audiofile. The dataloader will do most of the jobs.
Here we explain how to configure for your custom dataset. The explanation contains examples for Audiocaps, which is provided as default in this repo.
-
In the Config(yaml), change corresponding configurations. (example from audioldm_original_medium.yaml) (see audioldm_train/dataset_plugin.py for add_ons)
metadata_root: "./data/dataset/metadata/dataset_root.json" data: # data split and corresponding dataset names. should be mentioned as key in metadata_root json. train: ["audiocaps"] # either string of name or list of names val: "audiocaps" test: "audiocaps" class_label_indices: "audioset_eval_subset" dataloader_add_ons: [] # can attach add_ons for preprocessing while data loading variables: sampling_rate: &sampling_rate 16000 mel_bins: &mel_bins 64 # ... other variables preprocessing: audio: # config for audio waveform sampling_rate: *sampling_rate max_wav_value: 32768.0 duration: 10.24 stft: # config for Short Time Fourier Transform filter_length: 1024 hop_length: 160 win_length: 1024 mel: # config for mel spectrogram n_mel_channels: *mel_bins mel_fmin: 0 mel_fmax: 8000 # ... other configs
For default(basic) config, see get_basic_config function in AudioLDM2/audioldm2/utils.py.
-
Example directory branch After downloading the Audiocaps dataset to data/dataset, you will have the following folder structure. The explanation is written based on this structure.
. └── data/ └── dataset/ ├── audioset/ # dir for audio files │ └── ... └── metadata # dir for metadata ├── audiocaps # dataset name │ ├── datafiles/ │ │ ├── audiocaps_train_label.json │ │ └── ... │ ├── testset_subset/ │ │ ├── audiocaps_test_nonrepeat_subset_0.json │ │ └── ... │ └── class_labels_indices.csv └── dataset_root.json
-
Create a json for config["metadata_root"] (e.g. dataset_root.json)
{ "audiocaps": "./data/dataset/audioset", # directory to data files "comments":{ }, "metadata":{ # path to metadata for each data split "path": { "audiocaps":{ "train": "./data/dataset/metadata/audiocaps/datafiles/audiocaps_train_label.json", "test": "./data/dataset/metadata/audiocaps/testset_subset/audiocaps_test_nonrepeat_subset_0.json", "val": "./data/dataset/metadata/audiocaps/testset_subset/audiocaps_test_nonrepeat_subset_0.json", "class_label_indices": "./data/dataset/metadata/audiocaps/class_labels_indices.csv" } } } }
-
Create a json for metadata of each datasplit mentioned above (e.g. audiocaps_train_label.json)
{ "data": [ { "wav": "zip_audios/unbalanced_train_segments/unbalanced_train_segments_part36/Yr1nicOVtvkQ.wav", # path to audio (wav) "seg_label": "/mnt/fast/nobackup/scratch4weeks/hl01486/datasets/audiocaps_segment_labels/averaged/Yr1nicOVtvkQ.npy", # path to segment label (seems deprecated. not necessary) "labels": "/m/02jz0l,/m/09x0r", # label mID "caption": "A woman talks nearby as water pours" # caption }, ... ] }
-
Create a csv for mapping between class index, mid (Machine IDentifier), and name (e.g. class_labels_indices.csv)
index,mid,display_name 0,/m/09x0r,"Speech" 1,/m/05zppz,"Male speech, man speaking" ...
-
Preprocess is done according to the preprocessing configurations. For example, the Dataset object reads the waveform with a random start if audio is longer than predefined duration. (Refer to the _get_item_ method in AudioDataset class in this dataset.py)
If you found this tool useful, please consider citing
@article{liu2023audioldm,
title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
journal={Proceedings of the International Conference on Machine Learning},
year={2023}
}
@article{liu2023audioldm2,
title={{AudioLDM 2}: Learning Holistic Audio Generation with Self-supervised Pretraining},
author={Haohe Liu and Qiao Tian and Yi Yuan and Xubo Liu and Xinhao Mei and Qiuqiang Kong and Yuping Wang and Wenwu Wang and Yuxuan Wang and Mark D. Plumbley},
journal={arXiv preprint arXiv:2308.05734},
year={2023}
}
We greatly appreciate the open-soucing of the following code bases. Open source code base is the real-world infinite stone 💎!
- https://github.com/CompVis/stable-diffusion
- https://github.com/LAION-AI/CLAP
- https://github.com/jik876/hifi-gan
This research was partly supported by the British Broadcasting Corporation Research and Development, Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 "AI for Sound", and a PhD scholarship from the Centre for Vision, Speech and Signal Processing (CVSSP), Faculty of Engineering and Physical Science (FEPS), University of Surrey. For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising. We would like to thank Tang Li, Ke Chen, Yusong Wu, Zehua Chen and Jinhua Liang for their support and discussions.