Final project for the UPC Postgraduate Course Artificial Intelligence with Deep Learning, edition Spring 2021
Team: Anne-Kristin Fischer, Joan Prat Rigol, Eduard Rosés Gibert, Marina Rosés Gibert
Advisor: Gerard I. Gállego
GitHub repository: https://github.com/ttecles/aidl-lyrics-recognition
This is how we recommend to install the project.
- Introduction
- Data Set
- Working Environment
- General Architecture
- Preprocessing the data set
- Results and results improvement
- Web Application
- Conclusions
- Imagine one month more...
- Next Steps
- References
To this day few research is done in music lyrics recognition which is still considered a complex task. For its approach two subtasks can be determined:
- The singing voice needs to be extracted from the song by means of source separation. What seems to be an easy task for the human brain, remains a brain teaser for digital signal processing because of the complexe mixture of signals.
- The second subtask aims to transcribe the obtained audio of the singing voice into written text. This can be thought of as a speech recognition task. A lot of progress has been made for standard speech recognition tasks. Though, experiments with music made evident that the recognition of text of a singing voice is more complex than pure speech recognition due to its increasing acoustical features.
Practical applications for music lyrics recognition such as the creation of karaoke versions or music information retrieval tasks motivate to tackle the aforementioned challenges.
Our decision for a lyrics recognition task with deep learning techniques is the attempt to combine several of our personal and professional interests. All team members have a more or less professional background in the music industry additionally to a particular interest in source separation tasks and natural language processing.
Figure 1: Our passion for music, language and deep learning combined.
- Extraction of the voice of a song and transcription of the lyrics with Demucs and Wav2Vec models
- Analysis of the results
- Deployment of a web application for lyrics extraction
- Suggestions for further studies and investigation
To reach our goals, we set up the following milestones:
- Find a suitable data set
- Preprocess the data for its implementation into the model
- Define the model
- Implement the model
- Train the model
- Analyse the obtained results
- Implement the project inside a web application
- Make suggestions for further investigation
- Optional: add a language model to improve the results of the transcription task
To train our model we opted for the DALI data set, published in 2018. It is to this day the biggest data set in the field of singing voice research which aligns audio to notes and their lyrics along high quality standards. Access was granted to us for the first version, DALI v1, with 5358 songs in full duration and multiple languages. For more information please check as well this article, published by the International Society for Music Information Retrieval. This is a graphical representation of the DALI data set:
Figure 2: Alignment of notes and text in DALI data set based on triples of {time (start and duration), note, text}
Figure 3: Horizontal granularity in DALI data set where paragraphs, lines, words and notes are interconnected vertically
To develop the base model with 395 MM parameters, we used Google Colab as it was fast and easy for us to access. For visualization of the results we used wandb. For development we used a local environment. For the full training with 580 MM parameters we then switched to a VM instance with one GPU (Tesla K80) and 4 CPUs on Google Cloud. To improve performance we switched again to a VM with 4 GPUs (GeForce RTX 3090). PyTorch is used as the overall framework.
Few research is done so far for music lyrics recognition in general and mostly spectrograms in combination with CNNs are used. In the context of this project we explore the possibility of a highly performing alternative by combining two strong models: the Demucs model for the source separation task in combination with a Wav2Vec 2.0 model for the transcription task. Demucs is currently the best performing model for source separation based on waveform and so far the only waveform-based model which can compete with more commonly used spectrogram-based models. Wav2Vec is considered the current state-of-the-art model for automatic speech recognition. Additionally, we implemented KenLM as a language model on top to improve the output of the transcription task. As final model implementation we opted for the concatenation of a pretrained Demucs and pretrained Wav2Vec model to perform end-to-end training. The loss will be computed comparing the ground truth lyrics against the lyrics obtained in the Wav2Vec output. Demucs is built of a convolutional encoder plus LSTM plus convolutional decoder. The Wav2Vec is a model with convolutional layers and a transformer working on character level. For the final part of the Wav2Vec model we apply the CTC algorithm (Connectionist Temporal Classification). This CTC algorithm helps to delete repeated characters in the prediction as the Wav2Vec model is predicting a character every few milliseconds.
Figure 4: Overall model architecture with detailed insides in Demucs and Wav2Vec architecture
Preprocessing the data set correctly for our purpose was proven to be one of the major obstacles we encountered. We focused on songs in English only, that is 3491 songs in full duration. Preprocessing included omitting special characters as well as negative time stamps and transforming the lyrics in upper case only. To make sure to obtain meaningful results after training and to avoid cut-off lyrics, we prepared chunks. For these chunks we discarded words overlapping among consecutive chunks and we cut out silent passages without voice. To make data accessible for our model, we decided to resample the audio waveform to a sample rate of 44100 Hz. As alignment is done automatically in DALI and groundtruth is available only for few audio samples, we followed the suggestions for train/validation/test split by the authors. That is:
Figure 6: Suggested NCCt scores for train, validation and test
where NCCt is a correlation score which indicates how accurate the automatic alignment is. Higher means better. The number of tracks refers to the whole data set, including as well songs in other languages for both the first and second version of the dataset.
When doing a first train run over the full dataset, that is 59958 chunks, to our surprise we obtained initially a negative loss. This could be explained by the training of data slices containing no lyrics.
Step | Comments |
---|---|
Hypothesis | Our model will output awesome lyrics predictions. |
Set up | |
Results | Our model shows weird metrics. |
Conclusions | We are not sure if our model is even training. |
Links | Run, Report |
To make sure our model was actually working properly, a sanity check came in handy now where we tested the model on a small batch on its possibility to overfit. This training run showed as well the level of corruption for Demucs: the voice quality, epoch by epoch, got worse. Please see below the audio waveform at step 0 compared to step 29 for reference.
Step | Comments |
---|---|
Hypothesis | Our model works if it is “able” to overfit. |
Set up | |
Results | Our model overfits. |
Conclusions | Our model is working and actually training. |
Links | Run, Report, Audio track |
We now gradually augmented the batch size using a controllable small dataset with a NCCt score higher than 0.95 to make sure our model would still train properly.
Step | Comments |
---|---|
Hypothesis | Model should train full set. |
Set up | |
Results | We need more power. It crashed because of bugs. Dataset is not totally clean. |
Conclusions | Use multi GPU. |
Links | Run, Report |
When training on 5 GPUs, we were finally able to obtain some visuals. We even achieved to outperform the ConvTasNet model in this paper regarding the initial WER.
Step | Comments |
---|---|
Hypothesis | Performance will be better and we will obtain visuals (WER). |
Set up | |
Results | We obtained a better WER (69.348) than the initial WER in this paper, table 3 (WER=75.91). |
Conclusions | Performance improved and we can try to integrate the feature extractor in a follow-up experiment. |
Links | Run, Test results |
For the last experiment we considered a different learning rate and applied Wav2Vecs feature extractor.
Step | Comments |
---|---|
Hypothesis | Get better results by fine-tuning feature extractor. |
Set up | audio length: 5, learning rate: 1e-5, batch size: 42, epochs: 3, train length: 110030, alpha: 0.5, beta: 5.0 |
Results | WER: 68.593, BEAMSEARCH WER: 68.064, LM: 73.656 |
Conclusions | There is room for improvement in the language model. |
To show the results of our project, we additionally deployed a web application.
As already mentioned before recognition tasks in music are considered to remain complex. This could be confirmed as well for our lyrics recognition task in particular. The following reasons are determined: Today as then the community laments a lack of well structured, aligned, large data sets for music information retrieval tasks. Furthermore, few literature for reference was available. In consequence we needed to address the challenge to find appropriate dataset and model architecture. We found it particularly hard not only to understand potential model architectures but also to prepare the data appropriately. The finding that freezing features inside of a pretrained model would be necessary, was a special takeaway. We believe that our suggestion is a powerful solution for lyrics recognition. Though, its high computational cost in terms of time and money is evident and implementation remains to be optimized.
Based on our findings we consider the following scenarios worth further investigation:
- Train end-to-end after unfreezing Demucs
- Train more epochs
- Faster implementation of beam search
- Train with different models
In a broader sense and in the context of music information retrieval tasks in general, further research could be done for:
- Melody extraction
- Chords transcription
- Summary of the lyrics
- Contribution to larger datasets of high quality
https://ieeexplore.ieee.org/abstract/document/5179014
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1318.pdf
https://ieeexplore.ieee.org/document/6682644?arnumber=6682644
https://www.researchgate.net/publication/42386897_Automatic_Recognition_of_Lyrics_in_Singing
https://europepmc.org/article/med/20095443
https://asmp-eurasipjournals.springeropen.com/articles/10.1155/2010/546047
https://arxiv.org/abs/2102.08575
https://github.com/facebookresearch/demucs
http://ismir2018.ircam.fr/doc/pdfs/35_Paper.pdf
https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/
https://colab.research.google.com/notebooks/intro.ipynb?hl=en
https://pytorch.org/docs/stable/torch.html
https://transactions.ismir.net/articles/10.5334/tismir.30/
https://medium.com/descript/challenges-in-measuring-automatic-transcription-accuracy-f322bf5994f