janhq/ichigo

research: Flow Matching for synthetic data generation

tikikun opened this issue · 6 comments

Overall

We can significantly improve the quality of synthetic multi-modal datasets by using Flow Matching with Optimal Transport.

Context

Currently we make use of Autoregressive Model from WhisperSpeech to generate synthetic dataset specifically the t2s model.

Theoretical Details

The T2S model (Text-to-Semantics) predicts sound tokens from text tokens to generate synthetic data. This problem can be framed as:

"Transforming a distribution of text embeddings into synthetic sound token embeddings."

Alternatively, it can be stated as:

We address sequence-to-sequence embeddings generation tasks. Given a source sequence:

$$ w_x = {w_x^1, ..., w_x^m}, \quad \text{of length } M $$

we aim to develop a generative model that produces a target sequence:

$$ w_y = {w_y^1, ..., w_y^n}, \quad \text{of length } N $$

conditioned on the source sequence.

Empirical results, such as those in the F5-TTS paper, demonstrate that flow matching models efficiently solve this problem with high accuracy and low resource requirements. This approach also avoids the inherent issues of autoregressive models when generating synthetic data.

There is a possibility that we can produce novel results using this approach + increase ichigo performance significantly.

Next Steps

  • Adapt our dataset to a flow matching framework.
  • Develop a flow matching framework for T2S tasks.

You can use a continuous flow matching model to train essentially a text-based auto encoder.
The specific architecture should probably be conditional flow matching, with the text as the condition.
length of generation should be set using something as simple as a words-per-second heuristic
The decoder will a the frozen whisper decoder.
The goal is self-supervised text-to-text roundtrip through the CFM model and the decoder.
No guarantee that this will work at all but it would be damn interesting if it works.
I think it has a chance of working because we're distilling the information from the whisper decoder, which is a strong model.

Image

If it works, it means we will be able to train T2S model on all the languages supported by whisper, without the need for any audio data. all we need is some multi-lingual text data.

Also, check out this repo

https://github.com/lucidrains/voicebox-pytorch

It has a good implementation for the CFM model, relatively easy to read. I've used it before in my work.

It also has some links to Spear-TTS, the precursor to WhisperSpeech. Also E2-TTS and later F5-TTS might have built on top of it.

I think we might be able to frame this as a generative adversarial network: The CFM is the generator, and the discriminator is Whisper Decoder.