Whisper Karaoke is a small script that takes mp3-files as input, and uses the faster-whisper library to transcribe the audio file, and also split the transcription into separated words.
This is then used in a small web-player in a Karaoke-like fashion, by highlighting the words as they appear.
Dean.Lewis.-.How.Do.I.Say.Goodbye.mp4
Foudeqush.-.Un.Sueno.Raro.mp4
Leander.Kills.-.Fajdalom.Elvitelre.mp4
Infected.Mushroom.-.Lost.In.Space.mp4
G-Idle.-.Oh.My.God.mp4
Faun.-.Abschied.mp4
Epic.-.The.Circe.Saga.03.-.Done.For.mp4
Erwin.Khachikian.-.Dasht.mp4
The project requires Python 3.8 or greater. It also requires you to be able to run faster-whisper, which in turn requires CUDA 11 or higher.
GPU execution requires the following NVIDIA libraries to be installed:
CUDA 12.4 has been tested and runs fine.
Caution
I had big problems getting the cuDNN / cuBLAS to set up properly on Windows. The latest version that comes with an installer is not supported by faster-whisper, so this didn't work for me.
I ended up downloading this version: https://github.com/Purfview/whisper-standalone-win/releases/tag/libs. The files can either be placed in C:\Windows\System32\
, or in the same directory as the sythen script are in. Just unpack the 4 .dll-files and put them there and it should work.
Run setup.bat
to create a virtual environment and install the requirements.
It will create a launch.bat
which launches the virtual environment and runs app.py
.
This launches a window which contains a flask server. In this window you should see a URL, copy that to your web-browser.
- Create a virtual environment.
- Run
pip install -r requirements.txt
- Run
py app.py
- Open the URL of the launched flask server.
Note: Regarding the libraries, tkinterdnd2
is only needed for the batch_convert.py script, and flask
is only needed for the app.py script. faster_whisper
is required by both.
Once you have launched the page successfully, you should see something like this.
In order to transcribe a song or sound file, just drag and drop the mp3-file onto the box and it should start the transcription. The first time this is done, the required models will download to your temporary download folder for models.
It may take a few minutes, and there's no progress bar for this. Look in the flask server console windows for the latest information.
The mp3 will be copied to the /static/tracks/
-folder, and two text-files will be created alongside it. These are the transcriptions as lines and words for your file.
If you are successful, it should also automatically load this file for you in the web interface.
To re-launch this track in the future, you can choose it from the drop-down list, or drag the file onto the interface again. It will not transcribe the file again if both text-files already exist.
In order to transcribe multiple songs at the same time, you need to launch the batch_convert.bat
(using automatic setup), or py batch_convert.py
using the manual setup.
This should launch a small GUI window where you can drag multiple files onto. Once you do, look in the console for that window to see the progress of the conversion. The GUI window does not update you on the status of the conversion.
There are bugs. Synchronization isn't perfect, transcription is not perfect, and there are several known bugs with the music player. Feel free to report issues, and even better, fix them :)
-
Sometimes the first words of lines are missing even though they are there in the words file.
-
Some tracks get really messy overlapping text without linebreaks.
-
Words with overlapping timestamps should be pruned
-
Model word hallucinations, how can it be improved? Could we rerun with different seeds or models and merge outcome?