collabora/WhisperSpeech

Semantic -> acoustic modeling

jpc opened this issue ยท 7 comments

jpc commented

We got #3 working so now it's time to try to convert from Whisper-based semantic tokens (#3) to EnCodec-based acoustic tokens (#2).

We found out that better semantic tokens (from Whisper medium) make this task a lot easier and even tiny models sound great. Multilingual semantic token training helps and cross-language voice cloning works great.

There are a couple of hypothesis to test:

  • Can we train a forward model or does it have to be autoregressive to get anywhere? (no, but see SoundStorm)
  • To start simple, could we get away with single speaker training only? This would allow us to ignore the prompting for now and just let the model memorize the speaker. (seems to work on 1000hrs of one speaker)
  • How much data is needed to get bad performance (low quality intelligible speech)? (a 1000 hours seems enough, takes about a day on A100 to train)
  • And finally, last but not least: do the Whisper encoder embeddings retain enough phonetic information to do this at all. (from initial tests in #5 they seem to be closer to speech than to text)

We also still have a couple of engineering challenges:

  • fix the issue where the model starts generating noise after exactly 10s (this may be related to cross-attention and the 3x length difference between the encoder and decoder contexts)
  • investigate sigmaReparam from Apple (supposed to make training more stable)
  • use the optimized scaled dot product attention kernels from the newest PyTorch (should
    speed up the training a lot)
  • add prompting and multiple-speakers support (we currently condition on SpeechBrain speaker embeddings)
  • switch to AdaFactor (should use less memory than Ada so we can train on smaller GPUs)
jpc commented

I pushed the first version of the semantic to acoustic modeling based on the Whisper transformer model but it does not train so I am probably still having some bugs somewhere. I'm going to create a synthetic dataset and debug it like I did the quantization bottleneck.

jpc commented

I found some bugs in the code and now it trains successfully:

  1. Overfits quickly on 2 hrs of speech
  2. Trains without overfitting on my 160hr single-speaker dataset

The performance is still not great but it's a step in the right direction. :) It's still based on the old VQ/RQ tokens so this should help a bit (see #3).

I also experimented with using Whisper embeddings directly (without quantization) and it works. It allowed me to easily experiment with extracting the embeddings from other layers of the encoder. Seems promising to balance the difficulty of the translation tasks between text and semantic tokens vs. semantic tokens and acoustic tokens. For reference in SPEAR TTS the semantic to acoustic task was a lot easier (they used a decoder-only model with 12 layers, about the size of Whisper Base) than the text to semantic task (T5-Large โ€“ 24 layer encoder + 24 layer decoder, the exact same size as Whisper Medium).

So right now we will focus on trying to understand the balance between these two tasks.

jpc commented

I've trained a new S->A model, fixed the autoregressive sampling and it started generating some recognizable speech.

There is some serious bug (it generates only the first 10 seconds, everything afterwards is noise) but the common phrases ("This is a LibraVox recording", "Gentleman") already sound quite good (modulo the quality of the EnCodec speech codec at 1.5kbps). Once I figure out this bug it should start training a lot easier so I expect a big jump in quality on my next update. :)

jpc commented

I fixed the 10 second generation bug (it was a bug in the sampling code). I also found out that lowering multinomial sampling temperature to 0.8 improves the quality quite a lot.

I also trained another model, replacing cross-attention with adding the rescaled encoder features to the input of the middle layer of the decoder (both are sampled at a fixed rate so we don't need to learn to map one to the other) and got pretty good quality:

saar-1300hr-2l-20e-T0.8.mov
jpc commented

Oh, I forgot to mention that the new PyTorch 2.0 optimized attention implementation is amazing. With a very simple replacement I got 4x speedup on an A100.

Hi @jpc, thanks for this excellent work! I have a small question about the semantic to acoustic model. I notice that you set unique as False in ur data loader, which is different from the paper. Will the semantic tokens contain prosodic information of speech?

By the way, the above audio result comes from the "3. Semantic to acoustic token modeling.ipynb" or the "3B *.ipynb"? Could you provide some pre-trained models?

Thanks

jpc commented

Yup, our semantic tokens also carry prosody information. This makes the S2A models job easier and the overall solution faster. This means that prosody cannot be changed with voice cloning.

The newest samples (in the README) sound a lot better.