Lyrics Recognition using Deep Learning Techniques

Final project for the UPC Postgraduate Course Artificial Intelligence with Deep Learning, edition Spring 2021

Team: Anne-Kristin Fischer, Joan Prat Rigol, Eduard Rosés Gibert, Marina Rosés Gibert

GitHub repository: https://github.com/ttecles/aidl-lyrics-recognition

This is how we recommend to install the project.

Introduction
Data Set
Working Environment
General Architecture
Preprocessing the data set
Results and results improvement
Web Application
Conclusions
Imagine one month more...
Next Steps
References

1. Introduction

To this day few research is done in music lyrics recognition which is still considered a complex task. For its approach two subtasks can be determined:

The singing voice needs to be extracted from the song by means of source separation. What seems to be an easy task for the human brain, remains a brain teaser for digital signal processing because of the complexe mixture of signals.
The second subtask aims to transcribe the obtained audio of the singing voice into written text. This can be thought of as a speech recognition task. A lot of progress has been made for standard speech recognition tasks. Though, experiments with music made evident that the recognition of text of a singing voice is more complex than pure speech recognition due to its increasing acoustical features.

Practical applications for music lyrics recognition such as the creation of karaoke versions or music information retrieval tasks motivate to tackle the aforementioned challenges.

Step	Comments
Hypothesis	Our model will output awesome lyrics predictions.
Set up
Results	Our model shows weird metrics.
Conclusions	We are not sure if our model is even training.
Links	Run, Report

Step	Comments
Hypothesis	Our model works if it is “able” to overfit.
Set up
Results	Our model overfits.
Conclusions	Our model is working and actually training.
Links	Run, Report, Audio track

Step	Comments
Hypothesis	Model should train full set.
Set up
Results	We need more power. It crashed because of bugs. Dataset is not totally clean.
Conclusions	Use multi GPU.
Links	Run, Report

Step	Comments
Hypothesis	Performance will be better and we will obtain visuals (WER).
Set up
Results	We obtained a better WER (69.348) than the initial WER in this paper, table 3 (WER=75.91).
Conclusions	Performance improved and we can try to integrate the feature extractor in a follow-up experiment.
Links	Run, Test results

Step	Comments
Hypothesis	Get better results by fine-tuning feature extractor.
Set up	audio length: 5, learning rate: 1e-5, batch size: 42, epochs: 3, train length: 110030, alpha: 0.5, beta: 5.0
Results	WER: 68.593, BEAMSEARCH WER: 68.064, LM: 73.656
Conclusions	There is room for improvement in the language model.

ttecles/aidl-lyrics-recognition

Lyrics Recognition using Deep Learning Techniques

Table of Contents

1. Introduction

1.1 Motivation

1.2 Project Goals

1.3 Milestones

2. Data Set

3. Working Environment

4. General Architecture

5. Preprocessing the data set

6. Results and results improvement

6.1 Experiment 1: First train with the full dataset

6.2 Experiment 2: Overfitting with one chunk

6.3 Experiment 3: Long run in Google VM 2

6.4 Experiment 4: Multiple GPUs with WER

6.5 Experiment 5: Train with feature extractor

7. Web Application

8. Conclusions

9. Imagine 1 month more...

10. Next Steps

11. References