theodorblackbird/lina-speech

Congratulations! Model checkpoint on the Hugging Face Hub?

Vaibhavs10 opened this issue · 2 comments

Hi there,

I'm VB, I lead the advocacy efforts for open-source audio at Hugging Face.
Congratulations on such a brilliant checkpoint - Even at 60M the model performs considerably well.

It'd be great if you were to release the data preparation steps along with model checkpoints (the ones used in the demo page).

I'd personally love to scale this over to more data and perhaps test on Multi-lingual datasets like CML-TTS and so on.

More than happy to help you in anyway needed!

Cheers,
VB

Hi VB,

Thanks for your support !
I intend to release dataset intructions + checkpoints in the next iteration, likely in a couple of weeks. Actually I might have rushed a bit but I wanted to share with the RWKV community.

To clarify a bit :

  • The repo is a mess and I suggest you to not waste too much time on it.
  • The demo I shared has severe limitations since I trained it on long audio samples only as a proof-of-concept. As a consequence it behaves poorly on short text condition. The demo hide this fact by pairing prompt + target with targeted minimum sizes.
  • Next iteration will hopefully provide well-trained checkpoints up to 60M on various "linear attention", likely some versions of RWKV and Mamba.

Would be a pleasure to collaborate with you to open-source some multi-lingual models. I already knew and appreciate your work.

Théodor

Yay! Thanks for responding to me so quickly @theodorblackbird!

Looking forward to it. I'm in the process of thinking through generating high-quality synthetic audio datasets (it's at the starting and idea gathering phase). I think Mamba type arch is quite fit for it. Happy to collaborate more closely on that if you think it helps.

Also happy to invite you to our slack to discuss this more in detail. 🚀

Cheers!