WKaraokeMaker

Description

This work earns first place solution (on private test) of ZAC2022, LyricAlignment Track.

What is Karaoke Maker?

Karaoke Maker is a task that predicts the lyrics and melody of the given music audio. It is a task that can be used in various fields such as music production and karaoke you name it.

Method

Since the provided gt is too noisy. We first precompute loss scale for each word in the lyrics. The loss scale generally is the iou between output of force alignment(https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html) and the provided gt. If the iou is less than a certant threshold, we set the loss scale to 0. Otherwise, we set it to 1.
We modified whisper model to fit the competition task. An extra head is added to the encoder and trained with ctc loss. The decoder is extended withan word_seg_embed head to predict the word segments (start, end). We simply trained it with giou+l1 loss. The other part is kept the same as the original whisper model. Please take a look at, kmaker/model.py for more details.

Installation

ffmpeg # for audio/video processing
python >= 3.8
pytorch torchaudio

    conda create -n kmaker python 3.8
    # Install pytorch https://pytorch.org/get-started/locally/
    pip install git+https://github.com/openai/whisper.git 
    pip install -r requirements
    pip install -e ./

Usage

    python tools/predict_one_song.py asset/12300.json --audio_file asset/12300.mp3 --output_file output/12300.mp4

Example karaoke video

Colab Demo

Limitation

The current model only trained on Vietnamese music dataset
Length of the audio must be less than 30s

BaHuy15/WKaraokeMaker