Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion [ICML 2024]

Project page | Arxiv | Text-Based Space

This repository contains the official code release for Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion.

Change Log
Requirements
Usage Example
Evaluation
MedleyMDPrompts
Citation
Acknowledgements

Change Log

2024-10-12: Added support for text-based editing using Stable Audio Open 1.0!

Wasn't trained only on music, so results might vary. You can change the loaded checkpoint to finetuned models on music in models.py:StableAudWrapper.
You need to accept the model's license, then insert your Hugging Face token in main_run.py:HF_TOKEN.
Needs Diffusers >= 0.30.
Recommended to use with src_cfg of 1.

2024-09-09: Added a wrapper for a face-images unconditional LDM model (trained on CelebAHQ), relevant for unsupervised editing. Additionally, moved to PyTorch >= 2.2, Diffusers >= 0.26 to accommodate security concerns. The version tested in the paper is still reachable in the paper_code branch.

Requirements

python -m pip install -r requirements.txt

Usage Example

Supported models are AudioLDM, TANGO, and AudioLDM2. For unsupervised editing, Stable Diffusion is also supported.

Text-Based Editing

CUDA_VISIBLE_DEVICES=<gpu_num> python main_run.py --cfg_tar <target_cfg_strength> --cfg_src <source_cfg_strength> --init_aud <input_audio_path> --target_prompt <description of the wanted edited signal> --tstart <edit from timestep> --model_id <model_name> --results_path <path to dump results>

You can supply a source prompt that describes the original audio by using --source_prompt.
tstart is set to 100 by default, which is the configuration used in the user study. The quantitative results in the paper include values ranging from 40 to 100.

Use python main_run.py --help for all options.

use --mode ddim to run DDIM inversion and editing (note that for plain DDIM Inversion --tstart must be equal to num_diffusion_steps (by default set to 200)).

Unsupervised Editing

First extract the PCs for your wanted timesteps:

CUDA_VISIBLE_DEVICES=<gpu_num> python main_pc_extract_inv.py  --init_aud <input_audio_path> --model_id <model_name> --results_path <path to dump results> --drift_start <start extraction timestep> --drift_end  <end extraction timestep> --n_evs <amount of evs to extract>

You can supply a source prompt that describes the original audio by using --source_prompt.

Then apply the PCs:

CUDA_VISIBLE_DEVICES=<gpu_num> python main_pc_apply_drift.py --extraction_path <path to extracted .pt file> --drift_start <timestep to start apply> --drift_end <timestep to end apply> --amount <edit strength> --evs <ev nums to apply>

By using --use_specific_ts_pc <timestep num> you choose a different $t$ from $t'$.
Add --combine_evs to apply all the given PCs together.
Changing --evals_pt to empty will try to get the eigenvalues from the extracted path, and will not work unless the applied timesteps were run in extraction.

Use python main_pc_extract_inv.py --help and python main_pc_apply_drift.py --help for all options.

To recreate the random vectors baseline, use --rand_v. Image samples can be recreated using images_pc_extract_inv.py and images_pc_apply_drift.py.

SDEdit

SDEdit can be run similarly with:

CUDA_VISIBLE_DEVICES=<gpu_num> python main_run_sdedit.py --cfg_tar <target_cfg_strength> --init_aud <input_audio_path> --target_prompt <description of the wanted edited signal> --tstart <edit from timestep> --model_id <model_name> --results_path <path to dump results>

tstart is set to 100 by default, which is the configuration used in the user study. The quantitative results in the paper include values ranging from 40 to 100.

Use python main_run_sdedit.py --help for all options.

Image samples can be recreated using images_run_sdedit.py.

Evaluation

We provide our code used to run LPAPS, CLAP and FAD based evaluations. The code is adapted from multiple repos:

FAD is from microsoft/fadtk.
LPAPS is adapted from richzhang/PerceptualSimilarity.
CLAP is adapted from facebookresearch/audiocraft.

We provide the full code (that works on our directory structure) as an example of use.

MedleyMDPrompts

The MedleyMDPrompts dataset contains manually labeled prompts for the MusicDelta subset of the MedleyDB dataset Bittner et al. 2014. The MusicDelta subset is comprised of 34 musical excerpts in varying styles and in lengths ranging from 20 seconds to 5 minutes.
This prompts dataset includes 3-4 source prompts for each signal, and 3-12 editing target prompts for each of the source prompts, totalling 107 source prompts and 696 target prompts.
In the captions_targets.csv, the column can_be_used_without_source refers to whether this target prompt was designed to complement a source prompt or not, and therefore should provide enough information to edit a signal on their own. This is just a guideline, you might find that for your application all target prompts are enough on their own.
The source_caption_index column indexes the (ordered) index (starting from 1) of the source prompt for the same signal this target prompt relates to. This data can be used together with can_be_used_without_source.

Citation

If you use this code or the MedleyMDPrompts dataset for your research, please cite our paper:

@inproceedings{manor2024zeroshot,
  title =     {Zero-Shot Unsupervised and Text-Based Audio Editing Using {DDPM} Inversion},
  author =    {Manor, Hila and Michaeli, Tomer},
  booktitle = {Proceedings of the 41st International Conference on Machine Learning},
  pages =     {34603--34629},
  year =      {2024},
  editor =    {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume =    {235},
  series =    {Proceedings of Machine Learning Research},
  month =     {21--27 Jul},
  publisher = {PMLR},
  url =       {https://proceedings.mlr.press/v235/manor24a.html},
}

Acknowledgements

Parts of this code are heavily based on DDPM Inversion and on Gaussian Denoising Posterior.

AudioLDM2 is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Therefore, using the weights of AudioLDM2 (the default) and code originating in the code/audioldm folder is under the same license.
The weights of StableAudioOpen are licensed under Stability AI's Community License.
The rest of the code (inversion, PCs computation) is licensed under an MIT license.

The evaluation code adapts code from differently licensed repos:

FAD is from microsoft/fadtk, under MIT License.
LPAPS is adapted from richzhang/PerceptualSimilarity, under BSD-2-Clause License.
CLAP's weights are under CC0-1.0 License, from LAION-AI/CLAP
CLAP's processing code is adapted from facebookresearch/audiocraft, under MIT License.

Our MedleyMDPrompts dataset is licensed under CC-BY-4.0 License.

HilaManor/AudioEditingCode