The goal of the project is to build a text classification model on song lyrics and predict the artist from a piece of text.
Let's copy a piece of text from Frank Sinatra's famous song from a website and check whether the model can predict the singer right.
Result: With a probability of 97%, the model predicts that the singer of the chosen song is Frank Sinatra! 👏
See more results here.
- Choosing some artists from MetroLyrics.
- Web Scraping: downloading the URLs of all songs of chosen artists and getting song lyrics using Requests module, RegEx, and BeautifulSoup.
- Constructing text corpus (a list of strings) and labels.
- Cleaning the text with the help of Natural Language Toolkit (NLTK) or spaCy. There are both text cleaning methods for NLP in the classification_model.py.
- Converting a text corpus into a numerical matrix using Bag of Words method (BoW).
- Normalizing the counts with the Term Frequency and the Inverse Document Frequency (TF-IDF).
- Applying Naive Bayes algorithm for multinomially distributed data (MultinomialNB). Putting TF-IDF and MultinomialNB in a pipeline.
- Exporting the code from Jupyter to a Python file and сreating a pipeline for building a CLI.
- Creating Word Cloud with the most frequent words in songs of chosen artists: