/aidl-lyrics-recognition

Lyrics Recognition using Deep Learning Techniques

Primary LanguageJupyter NotebookCreative Commons Zero v1.0 UniversalCC0-1.0

Lyrics Recognition using Deep Learning Techniques

Final project for the UPC Postgraduate Course Artificial Intelligence with Deep Learning, edition Spring 2021

Team: Anne-Kristin Fischer, Joan Prat Rigol, Eduard Rosés Gibert, Marina Rosés Gibert

Advisor: Gerard I. Gállego

GitHub repository: https://github.com/ttecles/aidl-lyrics-recognition

This is how we recommend to install the project.

Table of Contents

  1. Introduction
    1. Motivation
    2. Project Goals
    3. Milestones
  2. Data Set
  3. Working Environment
  4. General Architecture
  5. Preprocessing the data set
  6. Results and results improvement
    1. Experiment 1: First train with the full dataset
    2. Experiment 2: Overfitting with one chunk
    3. Experiment 3: Long run in Google VM 2
    4. Experiment 4: Multiple GPUs with WER
    5. Experiment 5: Train with feature extractor
  7. Web Application
  8. Conclusions
  9. Imagine one month more...
  10. Next Steps
  11. References

1. Introduction

To this day few research is done in music lyrics recognition which is still considered a complex task. For its approach two subtasks can be determined:

  1. The singing voice needs to be extracted from the song by means of source separation. What seems to be an easy task for the human brain, remains a brain teaser for digital signal processing because of the complexe mixture of signals.
  2. The second subtask aims to transcribe the obtained audio of the singing voice into written text. This can be thought of as a speech recognition task. A lot of progress has been made for standard speech recognition tasks. Though, experiments with music made evident that the recognition of text of a singing voice is more complex than pure speech recognition due to its increasing acoustical features.

Practical applications for music lyrics recognition such as the creation of karaoke versions or music information retrieval tasks motivate to tackle the aforementioned challenges.

To top

1.1 Motivation

Our decision for a lyrics recognition task with deep learning techniques is the attempt to combine several of our personal and professional interests. All team members have a more or less professional background in the music industry additionally to a particular interest in source separation tasks and natural language processing.

Figure 1: Our passion for music, language and deep learning combined.

To top

1.2 Project Goals

  • Extraction of the voice of a song and transcription of the lyrics with Demucs and Wav2Vec models
  • Analysis of the results
  • Deployment of a web application for lyrics extraction
  • Suggestions for further studies and investigation

To top

1.3 Milestones

To reach our goals, we set up the following milestones:

  • Find a suitable data set
  • Preprocess the data for its implementation into the model
  • Define the model
  • Implement the model
  • Train the model
  • Analyse the obtained results
  • Implement the project inside a web application
  • Make suggestions for further investigation
  • Optional: add a language model to improve the results of the transcription task

To top

2. Data Set

To train our model we opted for the DALI data set, published in 2018. It is to this day the biggest data set in the field of singing voice research which aligns audio to notes and their lyrics along high quality standards. Access was granted to us for the first version, DALI v1, with 5358 songs in full duration and multiple languages. For more information please check as well this article, published by the International Society for Music Information Retrieval. This is a graphical representation of the DALI data set:

Figure 2: Alignment of notes and text in DALI data set based on triples of {time (start and duration), note, text}

Figure 3: Horizontal granularity in DALI data set where paragraphs, lines, words and notes are interconnected vertically

To top

3. Working Environment

To develop the base model with 395 MM parameters, we used Google Colab as it was fast and easy for us to access. For visualization of the results we used wandb. For development we used a local environment. For the full training with 580 MM parameters we then switched to a VM instance with one GPU (Tesla K80) and 4 CPUs on Google Cloud. To improve performance we switched again to a VM with 4 GPUs (GeForce RTX 3090). PyTorch is used as the overall framework.

To top

4. General Architecture

Few research is done so far for music lyrics recognition in general and mostly spectrograms in combination with CNNs are used. In the context of this project we explore the possibility of a highly performing alternative by combining two strong models: the Demucs model for the source separation task in combination with a Wav2Vec 2.0 model for the transcription task. Demucs is currently the best performing model for source separation based on waveform and so far the only waveform-based model which can compete with more commonly used spectrogram-based models. Wav2Vec is considered the current state-of-the-art model for automatic speech recognition. Additionally, we implemented KenLM as a language model on top to improve the output of the transcription task. As final model implementation we opted for the concatenation of a pretrained Demucs and pretrained Wav2Vec model to perform end-to-end training. The loss will be computed comparing the ground truth lyrics against the lyrics obtained in the Wav2Vec output. Demucs is built of a convolutional encoder plus LSTM plus convolutional decoder. The Wav2Vec is a model with convolutional layers and a transformer working on character level. For the final part of the Wav2Vec model we apply the CTC algorithm (Connectionist Temporal Classification). This CTC algorithm helps to delete repeated characters in the prediction as the Wav2Vec model is predicting a character every few milliseconds.

image

image
Figure 4: Overall model architecture with detailed insides in Demucs and Wav2Vec architecture

To top

_Figure 5: CTC loss: architecture and its calculation_

To top

5. Preprocessing the data set

Preprocessing the data set correctly for our purpose was proven to be one of the major obstacles we encountered. We focused on songs in English only, that is 3491 songs in full duration. Preprocessing included omitting special characters as well as negative time stamps and transforming the lyrics in upper case only. To make sure to obtain meaningful results after training and to avoid cut-off lyrics, we prepared chunks. For these chunks we discarded words overlapping among consecutive chunks and we cut out silent passages without voice. To make data accessible for our model, we decided to resample the audio waveform to a sample rate of 44100 Hz. As alignment is done automatically in DALI and groundtruth is available only for few audio samples, we followed the suggestions for train/validation/test split by the authors. That is:

image
Figure 6: Suggested NCCt scores for train, validation and test

where NCCt is a correlation score which indicates how accurate the automatic alignment is. Higher means better. The number of tracks refers to the whole data set, including as well songs in other languages for both the first and second version of the dataset.

To top

6. Results and results improvement

6.1 Experiment 1: First train with the full dataset

When doing a first train run over the full dataset, that is 59958 chunks, to our surprise we obtained initially a negative loss. This could be explained by the training of data slices containing no lyrics.

Step Comments
Hypothesis Our model will output awesome lyrics predictions.
Set up image
Results Our model shows weird metrics.
Conclusions We are not sure if our model is even training.
Links Run, Report

image

To top

6.2 Experiment 2: Overfitting with one chunk

To make sure our model was actually working properly, a sanity check came in handy now where we tested the model on a small batch on its possibility to overfit. This training run showed as well the level of corruption for Demucs: the voice quality, epoch by epoch, got worse. Please see below the audio waveform at step 0 compared to step 29 for reference.

Step Comments
Hypothesis Our model works if it is “able” to overfit.
Set up image
Results Our model overfits.
Conclusions Our model is working and actually training.
Links Run, Report, Audio track

image image image image

To top

6.3 Experiment 3: Long run in Google VM 2

We now gradually augmented the batch size using a controllable small dataset with a NCCt score higher than 0.95 to make sure our model would still train properly.

Step Comments
Hypothesis Model should train full set.
Set up image
Results We need more power. It crashed because of bugs. Dataset is not totally clean.
Conclusions Use multi GPU.
Links Run, Report

image

To top

6.4 Experiment 4: Multiple GPUs with WER

When training on 5 GPUs, we were finally able to obtain some visuals. We even achieved to outperform the ConvTasNet model in this paper regarding the initial WER.

Step Comments
Hypothesis Performance will be better and we will obtain visuals (WER).
Set up image
Results We obtained a better WER (69.348) than the initial WER in this paper, table 3 (WER=75.91).
Conclusions Performance improved and we can try to integrate the feature extractor in a follow-up experiment.
Links Run, Test results

image

To top

6.5 Experiment 5: Train with feature extractor

For the last experiment we considered a different learning rate and applied Wav2Vecs feature extractor.

Step Comments
Hypothesis Get better results by fine-tuning feature extractor.
Set up audio length: 5, learning rate: 1e-5, batch size: 42, epochs: 3, train length: 110030, alpha: 0.5, beta: 5.0
Results WER: 68.593, BEAMSEARCH WER: 68.064, LM: 73.656
Conclusions There is room for improvement in the language model.

To top

7. Web Application

To show the results of our project, we additionally deployed a web application.

To top

8. Conclusions

As already mentioned before recognition tasks in music are considered to remain complex. This could be confirmed as well for our lyrics recognition task in particular. The following reasons are determined: Today as then the community laments a lack of well structured, aligned, large data sets for music information retrieval tasks. Furthermore, few literature for reference was available. In consequence we needed to address the challenge to find appropriate dataset and model architecture. We found it particularly hard not only to understand potential model architectures but also to prepare the data appropriately. The finding that freezing features inside of a pretrained model would be necessary, was a special takeaway. We believe that our suggestion is a powerful solution for lyrics recognition. Though, its high computational cost in terms of time and money is evident and implementation remains to be optimized.

To top

9. Imagine 1 month more...

Based on our findings we consider the following scenarios worth further investigation:

  • Train end-to-end after unfreezing Demucs
  • Train more epochs
  • Faster implementation of beam search
  • Train with different models

To top

10. Next Steps

In a broader sense and in the context of music information retrieval tasks in general, further research could be done for:

  • Melody extraction
  • Chords transcription
  • Summary of the lyrics
  • Contribution to larger datasets of high quality

To top

11. References

https://towardsdatascience.com/wav2vec-2-0-a-framework-for-self-supervised-learning-of-speech-representations-7d3728688cae

https://ieeexplore.ieee.org/abstract/document/5179014

https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1318.pdf

https://ieeexplore.ieee.org/document/6682644?arnumber=6682644

https://www.researchgate.net/publication/42386897_Automatic_Recognition_of_Lyrics_in_Singing

https://europepmc.org/article/med/20095443

https://asmp-eurasipjournals.springeropen.com/articles/10.1155/2010/546047

https://arxiv.org/abs/2102.08575

https://github.com/facebookresearch/demucs

http://ismir2018.ircam.fr/doc/pdfs/35_Paper.pdf

https://wandb.ai/site

https://cloud.google.com/

https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/

https://colab.research.google.com/notebooks/intro.ipynb?hl=en

https://pytorch.org/docs/stable/torch.html

https://transactions.ismir.net/articles/10.5334/tismir.30/

https://distill.pub/2017/ctc/

https://medium.com/descript/challenges-in-measuring-automatic-transcription-accuracy-f322bf5994f

https://www.music-ir.org/mirex/abstracts/2020/RB1.pdf