⚠️ This project is still under development!
UltraSinger is a tool to automatically create UltraStar.txt, midi and notes from music. Meaning it automaticly pitch UltraStar files, adding text and tapping to UltraStar files and creates separate UltraStar karaoke files. It also can re-pitch current UltraStar files and calculates the possible in-game score.
Multiple AI models are used to extract text from the voice and to determine the pitch.
Please mention UltraSinger in your UltraStar.txt file if you use it. It helps other to find this tool. And it helps you that this tool gets improved and maintained. You should only use it on Creative Commons licensed songs.
There are many ways to support this project. Starring ⭐️ the repo is just one 🙏
You can also support this work on Patreon or Buy Me A Coffee.
This will help me alot to keep this project alive and improve it.
- Install Python 3.10 (older and newer versions has some breaking changes). Download
- Also install ffmpeg separately with PATH. Download
- Open a console (CMD) and navigate to the project folder.
- Type
py -3.10 -m venv .venv
and press enter. If this does not work, try instead ofpy
python
orpython3
.- If you have multiple versions installed, you can use
py -0p
to see all installed versions. - Build with the newest version use
py -m venv .venv
. But currently it only works with 3.10.
- If you have multiple versions installed, you can use
- Wait until the console is done with creating the environment. This can take a while.
- Type
.venv\Scripts\activate
and press enter. - You should see now a
(.venv)
in front of your console line. - Install the requirements with
pip install -r requirements.txt
. - Install gpu requirements
pip3 install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.2+cu117 --index-url https://download.pytorch.org/whl/cu117
- Now you can use the UltraSinger source code with
py UltraSinger.py [opt] [mode] [transcription] [pitcher] [extra]
. See How to use for more information.
For more information about Python environments look here.
Installation As copy:
py -3.10 -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
pip3 install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.2+cu117 --index-url https://download.pytorch.org/whl/cu117
Run UltraSinger:
- Activate the environment with
.venv\Scripts\activate
. (You dont need this if you already activated it, or just installed with the step above) - Navigate to src folder
cd src
- Start UltraSinger
py UltraSinger.py
Start environment only once:
.venv\Scripts\activate
cd src
Start UltraSinger:
py UltraSinger.py
Not all options working now!
UltraSinger.py [opt] [mode] [transcription] [pitcher] [extra]
[opt]
-h This help text.
-i Ultrastar.txt
audio like .mp3, .wav, youtube link
-o Output folder
[mode]
## INPUT is audio ##
default Creates all
# Single file creation selection is in progress, you currently getting all!
(-u Create ultrastar txt file) # In Progress
(-m Create midi file) # In Progress
(-s Create sheet file) # In Progress
## INPUT is ultrastar.txt ##
default Creates all
# Single selection is in progress, you currently getting all!
(-r repitch Ultrastar.txt (input has to be audio)) # In Progress
(-p Check pitch of Ultrastar.txt input) # In Progress
(-m Create midi file) # In Progress
[transcription]
# Default is whisper
--whisper Multilingual model > tiny|base|small|medium|large-v1|large-v2 >> ((default) is large-v2
English-only model > tiny.en|base.en|small.en|medium.en
--whisper_align_model Use other languages model for Whisper provided from huggingface.co
--language Override the language detected by whisper, does not affect transcription but steps after transcription
--whisper_batch_size Reduce if low on GPU mem >> ((default) is 16)
--whisper_compute_type Change to "int8" if low on GPU mem (may reduce accuracy) >> ((default) is "float16" for cuda devices, "int8" for cpu)
[pitcher]
# Default is crepe
--crepe tiny|full >> ((default) is full)
--crepe_step_size unit is miliseconds >> ((default) is 10)
[extra]
--hyphenation True|False >> ((default) is True)
--disable_separation True|False >> ((default) is False)
--disable_karaoke True|False >> ((default) is False)
--create_audio_chunks True|False >> ((default) is False)
--plot True|False >> ((default) is False)
[device]
--force_cpu True|False >> ((default) is False) All steps will be forced to cpu
--force_whisper_cpu True|False >> ((default) is False) Only whisper will be forced to cpu
--force_crepe_cpu True|False >> ((default) is False) Only crepe will be forced to cpu
For standard use, you only need to use [opt]. All other options are optional.
-i "input/music.mp3"
-i https://www.youtube.com/watch?v=BaW_jenozKc
This re-pitch the audio and creates a new txt file.
-i "input/ultrastar.txt"
Keep in mind that while a larger model is more accurate, it also takes longer to transcribe.
For the first test run, use the tiny
, to be accurate use the large-v2
model.
-i XYZ --whisper large-v2
Currently provided default language models are en, fr, de, es, it, ja, zh, nl, uk, pt
.
If the language is not in this list, you need to find a phoneme-based ASR model from
🤗 huggingface model hub. It will download automatically.
Example for romanian:
-i XYZ --align_model "gigant/romanian-wav2vec2"
Is on by default. Can also be deactivated if hyphenation does not produce anything useful. Note that the word is simply split, without paying attention to whether the separated word really starts at the place or is heard.
-i XYZ --hyphenation True
Pitching is done with the crepe
model.
Also consider that a bigger model is more accurate, but also takes longer to pitch.
For just testing you should use tiny
.
If you want solid accurate, then use the full
model.
-i XYZ --crepe full
The vocals are separated from the audio before they are passed to the models. If problems occur with this, you have the option to disable this function and the original audio file is used instead.
-i XYZ --disable_separation True
The score what the singer in the audio would receive will be measured. You get 2 scores, simple and accurate. You wonder where the difference is? Ultrastar is not interested in pitch hights. As long as it is in the pitch range A-G you get one point. This makes sense for the game, because otherwise men don't get points for high female voices and women don't get points for low male voices. Accurate is the real tone specified in the txt. I had txt files where the pitch was in a range not singable by humans, but you could still reach the 10k points in the game. The accuracy is important here, because from this MIDI and sheet are created. And you also want to have accurate files
With an GPU you can speed up the process and also the quality of the transcription and pitching is better.
You need a cuda device for this to work. If you use an MAC-System than sorry, there is no cuda device for MAC machines.
It is recommended, but optional, to install the cuda driver for your gpu see driver.
Install torch with cuda separately in your venv
. See tourch+cuda.
Also check you GPU cuda support. See cuda support
Command for pip
:
pip3 install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.2+cu117 --index-url https://download.pytorch.org/whl/cu117
When you want to use conda
instead you need a different installation command. See this link.
The pitch tracker used by UltraSinger (crepe), uses TensorFlow as it's backend. TensorFlow dropped GPU support for Windows for versions >2.10 as you can see in this release note and their installation instructions.
For now UltraSinger runs the latest version available that still supports GPUs on windows.
For running later versions of TensorFlow on windows while still taking advantage of GPU support the suggested solution is:
- install WSL2
- within the Ubuntu WSL2 installation
- run
sudo apt update && sudo apt install nvidia-cuda-toolkit
- follow the setup instructions for UltraSinger at the top of this document
- run
If something crashes because of low VRAM than use a smaller model.
Whisper needs more than 8GB VRAM in the large
model!
You can also force cpu usage with the extra option --force_cpu
.