UTAUTAI: Unrestricted Tune Automated Technology Artificial Interigence

README

📖 Quick Index

🚀Model Architecture
🤔What is UTAUTAI?
🐍Method
🧠TODO
🙏Appreciation
⭐️Show Your Support
🙆Welcome Contributions

🚀Model Architecture

🙇sorry for hand-draw

🤔What is UTAUTAI?

An open-source repository aimed at generating matching vocal and instrumental tracks from lyrics, similar to Suno AI's Chirp and Riffusion.

🐍Method

UTAUTAI's method are mainly inspired by SPEAR TTS

During training, the input consists of semantic tokens obtained from 'lyrics2semantic AR', which extracts semantic tokens from lyrics, as well as Acoustic tokens. Additionally, MERT representations derived from the music are subjected to k-means quantization to obtain further semantic tokens.

However, during inference, it is not possible to obtain MERT representations from the music. Therefore, we train a Style Module following the methodology of Prompt TTS2 to acquire the target MERT representations from the prompt during inference. The Style Module is composed of a transformer-based diffusion model.

I think that using this approach, we can successfully accomplish the target tasks. What do you think?

🧠TODO

How can we obtain lyrics that match the cropped audio? Or should we even crop the audio in the first place? code
Examine the handling of phonemization and special tokens, and make necessary code modifications. code
Correct the collator in the dataset. code
Complete the StyleModule inference code. code
Other minor code fixes, such as masking strategies.
Eliminate the diffusion model and adapt the consistency model.

🙏Appreciation

⭐️Show Your Support

If you find UTAUTAI interesting and useful, give us a star on GitHub! ⭐️ It encourages us to keep improving the model and adding exciting features.

🙆Welcome Contributions

Contributions are always welcome.

0417keito/UTAUTAI