research: Flow Matching for synthetic data generation
tikikun opened this issue · 6 comments
Overall
We can significantly improve the quality of synthetic multi-modal datasets by using Flow Matching with Optimal Transport.
Context
Currently we make use of Autoregressive Model from WhisperSpeech to generate synthetic dataset specifically the t2s model.
Theoretical Details
The T2S model (Text-to-Semantics) predicts sound tokens from text tokens to generate synthetic data. This problem can be framed as:
"Transforming a distribution of text embeddings into synthetic sound token embeddings."
Alternatively, it can be stated as:
We address sequence-to-sequence embeddings generation tasks. Given a source sequence:
we aim to develop a generative model that produces a target sequence:
conditioned on the source sequence.
Empirical results, such as those in the F5-TTS paper, demonstrate that flow matching models efficiently solve this problem with high accuracy and low resource requirements. This approach also avoids the inherent issues of autoregressive models when generating synthetic data.
There is a possibility that we can produce novel results using this approach + increase ichigo performance significantly.
Next Steps
- Adapt our dataset to a flow matching framework.
- Develop a flow matching framework for T2S tasks.
This could be related: https://github.com/dongzhuoyao/flowseq/tree/main
You can use a continuous flow matching model to train essentially a text-based auto encoder.
The specific architecture should probably be conditional flow matching, with the text as the condition.
length of generation should be set using something as simple as a words-per-second heuristic
The decoder will a the frozen whisper decoder.
The goal is self-supervised text-to-text roundtrip through the CFM model and the decoder.
No guarantee that this will work at all but it would be damn interesting if it works.
I think it has a chance of working because we're distilling the information from the whisper decoder, which is a strong model.
If it works, it means we will be able to train T2S model on all the languages supported by whisper, without the need for any audio data. all we need is some multi-lingual text data.
Also, check out this repo
https://github.com/lucidrains/voicebox-pytorch
It has a good implementation for the CFM model, relatively easy to read. I've used it before in my work.
It also has some links to Spear-TTS, the precursor to WhisperSpeech. Also E2-TTS and later F5-TTS might have built on top of it.
Oh he also has: https://github.com/lucidrains/e2-tts-pytorch
thanks lucidrains
Here is more inspiration about how to achieve some of this
I think we might be able to frame this as a generative adversarial network: The CFM is the generator, and the discriminator is Whisper Decoder.