This project allows you to build a simple Web application that, when combined with my other repository UmaChat, enables you to have conversations with Uma Musume characters.
The VITS inference part is based on https://github.com/Plachtaa/VITS-fast-fine-tuning. Many thanks!
1.Clone this repository
git clone https://github.com/kagari-bi/UmaChat_WebApi.git
2.Create a directory called 'models' in the root of the project, download the model you want from my HuggingFace repository and unzip it into the 'models' directory
3.Open config_backup.ini, enter your OpenAI account's api_key, Baidu account's appid and key (for translating ChatGPT's response into Japanese and then using VITS for inference), and the proxy address. Save and close it, then rename it to config.ini
4.Install dependencies
pip install -r requirements.txt
5.Run the Web application
uvicorn app:app --reload --host 0.0.0.0 --port 8000
Although I expect this project to eventually enable conversations with all Uma Musume characters, I cannot guarantee the timeframe. Therefore, you can add necessary files for your favorite Uma Musume characters in the action_mapping_table and prompt folders. You can also train the corresponding VITS model for each character and place it in the 'models' folder.
I will probably release a video tutorial on this later.
Currently, emotion recognition in this project is implemented using ChatGPT. However, without binding a payment method, ChatGPT's API can only be called three times per minute. When emotion recognition is involved, each question and answer requires calling the API twice, meaning it takes an average of 40 seconds to perform a single question and answer.
There are three solutions:
- Use two ChatGPT accounts
- Reduce the frequency of questions
- Bind a payment method
I do not recommend the third option personally, as I have not optimized the continuous conversation logic yet. Currently, the dialogue record and question are simply sent together as a request, which can cause a sharp increase in token consumption with the increasing number of dialogue turns, making it expensive.
I haven't come up with a particularly good solution yet. I might implement the emotion recognition part with another large language model in the future, but I can't guarantee that I will definitely do it (I'm quite the procrastinator).
The current Q&A logic involves obtaining responses from ChatGPT, translating them into Japanese using Baidu Translate, then using Vits to convert the text into audio. Finally, the text and audio content are combined as the response content.
The problem with this approach is that it is difficult to reproduce certain expressions with character-specific features (such as "ふふ" and "ごきげんよう" of FineMotion). One possible solution is to have ChatGPT provide a Japanese response, then call the translation interface to translate it into Chinese as the returned text.
However, Baidu Translate's results can still be awkward in some cases. If you want to be nitpicky, you might need to use a large language model... I'm currently exploring better solutions.
Help bring Uma Musume characters to the desktop pet world (just kidding). In fact, after making the project applicable to your favorite Uma Musume characters in the advanced usage section, you can submit the additional files to this project via pull requests or any other possible methods.
- Avoid reinventing the wheel, significantly improving efficiency
- That's it
The proxy address format looks like http://127.0.0.1:1920. The specific method to check the proxy address depends on the software you are using, and you can try searching for relevant keywords to see if you can find answers online.