Similarities and distinctions from fellow work "ConsistencyTTA"
Bai-YT opened this issue · 0 comments
Thank you for the awesome work! Accelerating text-to-audio generation is an important goal, and AudioLCM's contributions to this area are significantly appreciated.
We would like to bring to your attention our paper from September 2023, titled ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation, that explored a similar idea. ConsistencyTTA's code and model checkpoints are available here and here.
After a discussion with @liuhuadai, we agree that while ConsistencyTTA and AudioLCM see numerous similarities, they also have distinct differences.
The main similarities include:
- Latent-space consistency model and its general single-stage distillation and inference procedures (Section 3.2 of ConsistencyTTA and Section 3.5 of AudioLCM).
- Guided Distillation (Section 3.3 of ConsistencyTTA and Section 3.3 of AudioLCM).
- The use of AudioCaps as an evaluation benchmark for the text-to-audio application and the capability of fast, high-quality generation. Both methods achieve hundreds-fold acceleration over diffusion baselines.
- A much more coarse discretization scheme for the diffusion trajectory during consistency distillation, compared to during the training of the teacher diffusion model (Section 3.2 of ConsistencyTTA and Section 3.4 of AudioLCM).
The main differences include:
- ConsistencyTTA additionally proposes to further fine-tune the consistency model by directly optimizing the CLAP score.
- AudioLCM additionally considers text-to-music generation.
- ConsistencyTTA emphasizes single-step generation, whereas AudioLCM emphasizes the few-step regime. In particular, ConsistencyTTA’s single-step performance (
$FAD=2.4$ , Table 1) seems stronger than AudioLCM’s single-step ($FAD\approx 4$ , Table 2b), but weaker than AudioLCM’s two-step generation ($FAD=1.67$ , Table 1). - ConsistencyTTA uses TANGO as the diffusion teacher model, whereas AudioLCM uses Make-An-Audio 2. As a result, the model architecture is also different -- ConsistencyTTA uses a UNet whereas AudioLCM uses an improved diffusion transformer.
- ConsistencyTTA uses a single solver step to "jump" between the coarse discretization steps, whereas AudioLCM further divides these coarse intervals and performs multi-step ODE solving to "walk" between them. Intuitively, AudioLCM’s approach incurs a smaller solver error (assuming using the same solver), but takes more teacher queries for each training iteration.
We therefore believe that AudioLCM is a valuable complement to ConsistencyTTA, providing important insights and understandings in consistency-models-powered text-to-audio generation. Shout out to @liuhuadai for the constructive discussion. The AudioLCM paper will be revised shortly to include this comparison.