Text-to-Audio/AudioLCM

Follow up on "similarities with and distinctions from the fellow work ConsistencyTTA"

Closed this issue · 6 comments

We appreciate the authors for clarifying the similarities and the distinctions from a closely related work ConsistencyTTA in Discussion #2.

Originally posted by Bai-YT June 6, 2024
Thank you for the awesome work! Accelerating text-to-audio generation is an important goal, and AudioLCM's contributions to this area are significantly appreciated.

We would like to bring to your attention our paper from September 2023, titled ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation, that explored a similar idea. ConsistencyTTA's code and model checkpoints are available here and here.

After a discussion with @liuhuadai, we agree that while ConsistencyTTA and AudioLCM see numerous similarities, they also have distinct differences.

The main similarities include:

  • Latent-space consistency model and its general single-stage distillation and inference procedures (Section 3.2 of ConsistencyTTA and Section 3.5 of AudioLCM).
  • Guided Distillation (Section 3.3 of ConsistencyTTA and Section 3.3 of AudioLCM).
  • The use of AudioCaps as an evaluation benchmark for the text-to-audio application and the capability of fast, high-quality generation. Both methods achieve hundreds-fold acceleration over diffusion baselines.
  • A much more coarse discretization scheme for the diffusion trajectory during consistency distillation, compared to during the training of the teacher diffusion model (Section 3.2 of ConsistencyTTA and Section 3.4 of AudioLCM).

The main differences include:

  • ConsistencyTTA additionally proposes to further fine-tune the consistency model by directly optimizing the CLAP score.
  • AudioLCM additionally considers text-to-music generation.
  • ConsistencyTTA emphasizes single-step generation, whereas AudioLCM emphasizes the few-step regime. In particular, ConsistencyTTA’s single-step performance ($FAD=2.4$, Table 1) seems stronger than AudioLCM’s single-step ($FAD\approx 4$, Table 2b), but weaker than AudioLCM’s two-step generation ($FAD=1.67$, Table 1).
  • ConsistencyTTA uses TANGO as the diffusion teacher model, whereas AudioLCM uses Make-An-Audio 2. As a result, the model architecture is also different -- ConsistencyTTA uses a UNet whereas AudioLCM uses an improved diffusion transformer.
  • ConsistencyTTA uses a single solver step to "jump" between the coarse discretization steps, whereas AudioLCM further divides these coarse intervals and performs multi-step ODE solving to "walk" between them. Intuitively, AudioLCM’s approach incurs a smaller solver error (assuming using the same solver), but takes more teacher queries for each training iteration.

We therefore believe that AudioLCM is a valuable complement to ConsistencyTTA, providing important insights and understandings in consistency-models-powered text-to-audio generation. Shout out to @liuhuadai for the constructive discussion. The AudioLCM paper will be revised shortly to include this comparison.

In this discussion, the authors of AudioLCM clarified that one of the major contributions of AudioLCM is to use a multi-step ODE solver to accelerate the distillation process. When I read the core distillation code of AudioLCM, it seems to me that there is only one query to the teacher model per distillation iteration. Could you please clarify how the multi-step ODE solving is implemented? Thank you!

Hello, our paper has more details about the multi-step ODE solver for your concerns. You can change the parameter of num_ddim_steps in config file to achieve the skip step you want.

@liuhuadai Could you please link to the num_ddim_steps parameter? I did not find it in the config files. Thank you.

You can find it in audiolcm.yaml of Line 20

So the parameter is num_ddim_timesteps, not num_ddim_steps.

Doesn't this parameter adjust the number of total steps in the DDIM solver?
The DDIM solver itself only solves a single step with the teacher model at each distillation iteration, no?

Our DDIM solver conducts a estimate from tn->tm, while m = n - k(multi-step value). Please refer to Repo https://github.com/luosiallen/latent-consistency-model for more details.

It looks like the linked repository also only queries the teacher model a single time during distillation, i.e., $k=1$, which makes it identical to ConsistencyTTA in this aspect.
Source: Lines 1172-1209 of this file.