This is a solution for track Lyric Alignment - From VTS-HTML (TOP 3 Private test Zalo AI Challenge)
Details are at the link of the ZaloAI - Lyric Alignment
Our method is based on paper Improving Lyrics Alignment through Joint Pitch Detection by Jiawen Huang, Emmanouil Benetos, Sebastian Ewer, (ICASSP). 2022.
The overview will be discussed below this figure
Realizing that in the ZaloAI dataset there are many songs with fast tempo and difficult to hear the vocals, so we decided to preprocess the audio with a small module Vocal remover by Tsurumeso. From now on, we have an audio file with the sound containing only clear vocals.
The acoustic model takes the Mel-spectrogram of the separated vocals as input, and produces the phoneme posteriorgram.
It consists of a convolutional layer, a residual convolutional block, a fully-connected layer, 3 bidirectional LSTM (Long Short-Term Memory) layers, a final fullyconnected layer, and non-linearities in between. Model trained using CTC Loss.
The alignment process use Viterbi forced alignment. This can be computed efficiently via dynamic programming.
First, we tried to fine-tuning this method on ZaloAI 2022 datasets, the results in the public leaderboard seem not good
So we used the pretrained baseline to create pseudo labels.
However, instead of using 100% pseudo-labels, we calculate IoU between pseudo-labels and GT from ZaloAI Datasets. If IoU is less than 0.7, we will use pseudo-label, otherwise we will use GT from Zalo. (approximate 1/3 total samples)
After training with this dataset, the results improved significantly
The predicted results of our model are quite close to reality. However, GT has a characteristic that the end time of the previous word and the start time of the following word is the same. So we extend the gaps between the start and end times of two adjacent words into the previous word. This improves our results quite a bit
Check the notebook for a quick example.
Update later...
[1] Jiawen Huang, Emmanouil Benetos, Sebastian Ewert, "Improving Lyrics Alignment through Joint Pitch Detection," International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022.
[2] Gabriel Meseguer-Brocal, Alice Cohen-Hadria, and Geoffroy Peeters, “Creating DALI, a large dataset of synchronized audio, lyrics, and notes,” Transactions of the International Society for Music Information Retrieval, vol. 3, no. 1, pp. 55–67, 2020.
[3] https://github.com/tsurumeso/vocal-remover
Hoang Vien Duy - hoangduy.cqb.2k@gmail.com