/lyrical-exercise

Topic modeling and similarity metrics based on NLP analysis of song lyrics.

Primary LanguagePython

Lyrical Exercise

A Data Science Project by Eric Nadasi


Inspiration:

I have loved music my entire life, and am always on the look for new songs. While using Spotify one day, I became curious as to why song recommenders (not just Spotify's, also others such as Apple's and Pandora's) only focus on the melody, genre, or general sound of the song you are listening to in order to suggest the next song. In my opinion, the music that people connect to at the deepest level has meaningful lyrics that resonate with the listener, and a successful recommender should include songs with similar lyrical topics, so I created a model with Natural Language Processing that would do just that.

1. Project Overview

Goals:

  • To analyze topics in popular music through the unsupervised learning of topic modeling
  • To create a "lyrically similar song" model, which is able to show the user other songs with similar meaningful lyrics and themes

Methods:

  • Conducted Natural Language Processing (NLP) pipeline to prepare lyrics data
  • Created TF-IDF model for lyrics data
  • Tested both Latent Dirichlet Allocation and Non-negative Matrix Factorization (NMF) topic modeling
  • Used the more successful NMF topic modeling to discover latent topics within the music lyrics corpus
  • Defined a cosine similarity metric to identify lyrically similar songs based on TF-IDF model

Results:

  • Topics were mostly expected but some interesting and unexpected ones
    • i.e.: For the Hip-Hop/Rap genre, topics included money, women, relationships/love, life/death, ad-libs, and more.
  • Similar song program successfully shows user other songs of similar lyrics and themes (we can't quantify the accuracy of unsupervised learning, but check it out for yourself!)

2. Method Descriptions

  • Coded in Python using several different packages and technologies, including: Jupyter Notebook, NumPy, Pandas, SciPy, NLTK (Natural Language Toolkit), Gensim, and Scikit Learn
  • Algorithms:
    • Gensim was used to calculate Term Frequency - Inverse Document Frequency (TF-IDF) model vectors for all songs with lyrics data
      • TF-IDF assigns lesser importance to words that are found in more documents, making rare words more important to the classification of the song's lyrics
    • NMF for topic modeling from Scikit Learn was carried out on this TF-IDF model
      • Creates matrices showing individual words' associations with topics, and individual songs' associations with topics
    • Cosine similarity metric from Scikit Learn used to identify lyrically similar songs, also based on TF-IDF

3. Instructions for Running Program

Fork the repository, and then clone it to your local machine. You should now have a repository named "lyrical-exercise" on your computer.

All necessary data files but two are in the repository, in the data folder. Follow the steps below to get the remaining two files (You only need to do this once!).

How to get the remaining two files:

  1. mxm_779k_matches.txt:
  2. mxm_dataset_FULL.txt:

Choose Preferred Genre:

Go to Line 22 of Process.py and change the global variable FILTER_GENRE to equal any of the following genres (default: Rap):

  • Blues
  • Country
  • Electronic
  • Folk
  • Jazz
  • Latin
  • Metal
  • New Age
  • Pop
  • Punk
  • Rap
  • Reggae
  • RnB
  • Rock
  • World

Run the program:

  • Enter the lyrical-exercise folder from the terminal (using cd command), and enter the command python lyrical-exercise.py
  • Follow along with the prompts in the terminal to choose program inputs

Check the Results:

  • Topic Modeling: After running the program, check the respository for a file named topic_words.txt, which has all of the lyrics organized by latent topic
  • Similar Song Recommender: This runs in the terminal until you exit