This work earns first place solution (on private test) of ZAC2022, LyricAlignment Track.
Karaoke Maker is a task that predicts the lyrics and melody of the given music audio. It is a task that can be used in various fields such as music production and karaoke you name it.
-
Since the provided gt is too noisy. We first precompute loss scale for each word in the lyrics. The loss scale generally is the iou between output of force alignment(https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html) and the provided gt. If the iou is less than a certant threshold, we set the loss scale to 0. Otherwise, we set it to 1.
-
We modified whisper model to fit the competition task. An extra head is added to the encoder and trained with ctc loss. The decoder is extended withan
word_seg_embed
head to predict the word segments (start, end). We simply trained it with giou+l1 loss. The other part is kept the same as the original whisper model. Please take a look at, kmaker/model.py for more details.
- ffmpeg # for audio/video processing
- python >= 3.8
- pytorch torchaudio
conda create -n kmaker python 3.8
# Install pytorch https://pytorch.org/get-started/locally/
pip install git+https://github.com/openai/whisper.git
pip install -r requirements
pip install -e ./
python tools/predict_one_song.py asset/12300.json --audio_file asset/12300.mp3 --output_file output/12300.mp4
- The current model only trained on Vietnamese music dataset
- Length of the audio must be less than 30s