MATLAB script to analyze song lyrics word for word by genre. Can also be used to predict an unknown song's genre.
See Published_Code_Full_Execution.pdf for a walkthrough of the program execution.
Here's an explanation of the project that I wrote for an English class:
For my MATLAB project in ENGR122, I partnered up with my roommate, Ebed, and we decided to analyze the frequencies of certain words across different music genres. The end goal was to see if we could find a characteristic fingerprint of specific words representing each genre.
The first step was to download the lyrics from a bunch of songs from different genres. We (completely arbitrarily) predicted that it would probably be good to have a few thousand songs for each genre. After a few google searches, we realized the only way we could get the lyrics from so many songs was to download them ourselves. Individually! Because, there was no way it was going to be feasible to manually copy and paste over 10 thousand songs, we decided we'd have to write a program to do it for us. I downloaded the robobrowser library for python and started exploring web scraping. I was able to set up a program so that we could supply it with lists of artists sorted by genre and it would return a text file for each genre with all of the lyrics for each song in that file. We then imported these files into MATLAB to actually start our project.
I'm not going to dive too deep into the algorithm but I'll scrape the surface a little bit (If you want to learn more, you can look at the source code). After importing all of the words, we created an array that holds all of the unique words in all of the lyrics we downloaded. We then created a vector for each genre that holds all of the percentages of each of the unique words in each genre text file. The indices of the unique words list and the percent vectors align so that each percent corresponds to the unique word at its index. For example, if the first word is "potato", the first percent in every genre's vector will represent the frequency of the word potato in that genre's lyrics divided by the total number of words in that genre's lyrics. After this step, we had a machine made up of "fingerprints" for each genre.
We then wanted to test out our representation to see if we could give it a random song and have it categorize that song into a certain genre. To do this we created a similar percent vector for the unknown song using the same list of unique words. In other words, for every unique word in all of the genres, we checked the percent of that word in the unknown song and put that value into the song's percent vector at the same index as the word. Obviously for the song, most of its percents would be zero because there were nearly fifty thousand unique words total and the song only has a fraction of that number. Once we had this new "fingerprint" of the song, we could compare it to each genres' "fingerprint" to find the closest one. We did this using geometry by assigning a point in 50,000 dimension space to each "fingerprint" and finding the shortest distance between the song's point and each of the genres' points. If you're having trouble visualizing this, imagine the same scenario if there were only two unique words we were comparing. Each word's percentage represents one of the axes and the song and each genre fall somewhere on the graph as a point based on their percentages of each the two words. Then the genre that the song is geometrically closest to will be the one with the most similar percentages of the two words. Now, if we extend this into 50,000 space, it will work exactly the same way. Each dimension/axis represents the percentage of each unique word. The song and genre points all fall somewhere in the space. The closest genre's point to the song's point will have the most similar percentages and should theoretically be the genre for that song.
We found that this was accurate roughly half the time. We were able to increase this percentage a little bit by filtering out some of the most common words in the English language such as "the" or "and". We found that these words were popular in every genre but the slight variation in frequency could give us inaccurate results because these words are not characteristic of any certain one. However, we also found that sometimes we would get the correct genre without any filtering and then the wrong genre with filtering. Ultimately, we concluded that while some words were definitely more popular in certain genres, it isn't reliable to try to categorize a song solely based on the lyrics. Songs are named a specific genre based on a variety of factors such as rhythm and tempo and all sorts of other things we did not account for.