collabora/WhisperSpeech

4. Text -> semantic tokens modeling

jpc opened this issue · 2 comments

jpc commented

This will be a model that converts text tokens into Whisper-encoder-derived semantic tokens. With that we will have a complete TTS pipeline.

To train it we can re-use Whisper as our back-translation model (the paper had to train one from scratch). We can use the existing distillation setup as a starting point but we will have to make sure we get all the text tokens since Whisper has a tendency to cut the decodings short and asks you (using timestamp tokens) to rerun it with more data.

This was a pretty difficult task in the original SPEAR TTS implementation (they had to use a 24-layer model).

@jpc I have a suggestion, There is already a text to semantic tokens pipeline in project Bark.

I wonder, can we directly use it or take it as a reference

jpc commented

Ok, we now have the complete pipeline and T2S turned out to be the more difficult part, same as SPEAR TTS. We get great performance with the small model (this is what the current README samples are based on). Switching to medium does not seem to improve the results.

The biggest missing thing right now seems to be control over emotions and prosody. We don't have anything similar to speaker embeddings to condition on so we are looking at alternative approaches.