/whisper-karaoke

An implementation of faster_whisper that outputs as karaoke

Primary LanguageHTMLGNU General Public License v3.0GPL-3.0

Whisper Karaoke

Whisper Karaoke is a small script that takes mp3-files as input, and uses the faster-whisper library to transcribe the audio file, and also split the transcription into separated words.

This is then used in a small web-player in a Karaoke-like fashion, by highlighting the words as they appear. image

Video Examples

Dean.Lewis.-.How.Do.I.Say.Goodbye.mp4
Foudeqush.-.Un.Sueno.Raro.mp4
Leander.Kills.-.Fajdalom.Elvitelre.mp4
Infected.Mushroom.-.Lost.In.Space.mp4
G-Idle.-.Oh.My.God.mp4
Faun.-.Abschied.mp4
Epic.-.The.Circe.Saga.03.-.Done.For.mp4
Erwin.Khachikian.-.Dasht.mp4

Requirements

The project requires Python 3.8 or greater. It also requires you to be able to run faster-whisper, which in turn requires CUDA 11 or higher.

GPU execution requires the following NVIDIA libraries to be installed:

cuBLAS for CUDA 11

cuDNN 8 for CUDA 11

CUDA 12.4 has been tested and runs fine.

Caution

I had big problems getting the cuDNN / cuBLAS to set up properly on Windows. The latest version that comes with an installer is not supported by faster-whisper, so this didn't work for me.

I ended up downloading this version: https://github.com/Purfview/whisper-standalone-win/releases/tag/libs. The files can either be placed in C:\Windows\System32\, or in the same directory as the sythen script are in. Just unpack the 4 .dll-files and put them there and it should work.

Installation

Automatic installation

Run setup.bat to create a virtual environment and install the requirements.

It will create a launch.bat which launches the virtual environment and runs app.py.

This launches a window which contains a flask server. In this window you should see a URL, copy that to your web-browser.

Manual installation

  1. Create a virtual environment.
  2. Run pip install -r requirements.txt
  3. Run py app.py
  4. Open the URL of the launched flask server.

Note: Regarding the libraries, tkinterdnd2 is only needed for the batch_convert.py script, and flask is only needed for the app.py script. faster_whisper is required by both.

How to use

Once you have launched the page successfully, you should see something like this.

image

In order to transcribe a song or sound file, just drag and drop the mp3-file onto the box and it should start the transcription. The first time this is done, the required models will download to your temporary download folder for models.

It may take a few minutes, and there's no progress bar for this. Look in the flask server console windows for the latest information.

image

The mp3 will be copied to the /static/tracks/ -folder, and two text-files will be created alongside it. These are the transcriptions as lines and words for your file.

If you are successful, it should also automatically load this file for you in the web interface.

To re-launch this track in the future, you can choose it from the drop-down list, or drag the file onto the interface again. It will not transcribe the file again if both text-files already exist.

Batch transcribing files

In order to transcribe multiple songs at the same time, you need to launch the batch_convert.bat (using automatic setup), or py batch_convert.py using the manual setup.

image

This should launch a small GUI window where you can drag multiple files onto. Once you do, look in the console for that window to see the progress of the conversion. The GUI window does not update you on the status of the conversion.

image

Known bugs

There are bugs. Synchronization isn't perfect, transcription is not perfect, and there are several known bugs with the music player. Feel free to report issues, and even better, fix them :)

  • Sometimes the first words of lines are missing even though they are there in the words file.

  • Some tracks get really messy overlapping text without linebreaks.

  • Words with overlapping timestamps should be pruned

  • Model word hallucinations, how can it be improved? Could we rerun with different seeds or models and merge outcome?