/coldplay_sentiment_analysis

Sentiment Analysis using the Genius API

Primary LanguageJupyter Notebook

coldplay_sentiment_analysis

Sentiment Analysis using the Genius API

A simple how-to on generating lyrics from the Lyricsgenius API, tokenizing words from the lyrics, and analyzing their sentiment. Based on this tutorial on TowardsDataScience.

We need to first create code that will be used to:

  1. search the Genius API for artist, title, and lyrics information,

  2. create a dataframe for artist, title, and lyrics,

  3. clean the lyrics of words/information that isn’t relative to them,

  4. create a function that splits the lyrics into words that can be used to understand the sentiment of each song,

  5. remove stopwords and lemmatize each word, and finally

  6. analyze the sentiment of each songs' lyrics and plot them on a scatter plot.

What does that look like in code? Save your script as a Python file, I chose to name mine simply script.py

import lyricsgenius as genius
import pandas as pd
import string

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('wordnet')


def search_data(query, n, access_token):
    """
   This is the info from the Genius API
    """

    api = genius.Genius(access_token)

    list_lyrics = []
    list_title = []
    list_artist = []
    #list_album = []
    #list_year = []

    artist = api.search_artist(query, max_songs=n, sort='popularity')
    songs = artist.songs
    for song in songs:
        list_lyrics.append(song.lyrics)
        list_title.append(song.title)
        list_artist.append(song.artist)
        #list_album.append(song.album)
        #list_year.append(song.year)

    df = pd.DataFrame({'artist': list_artist, 'title': list_title, 'lyric': list_lyrics})

    return df


def clean_lyrics(df, column):
    """
    This function cleans the lyrics of words that aren't important to understanding the sentiment of the lyrics.
    """
    df = df
    df[lyric] = df[lyric].str.lower()
    df[lyric] = df[lyric].str.replace(r"verse |[1|2|3]|chorus|bridge|outro", "").str.replace("[", "").str.replace("]",
                                                                                                                    "")
    df[lyric] = df[lyric].str.lower().str.replace(r"instrumental|intro|guitar|solo", "")
    df[lyric] = df[lyric].str.replace("\n", " ").str.replace(r"[^\w\d'\s]+", "").str.replace("efil ym fo flah", "")
    df[lyric] = df[lyric].str.strip()

    return df


def lyrics_to_words(document):
    """
    This function splits the text of lyrics to single words, removing stopwords and doing the lemmatization to each word
    """
    stop_words = set(stopwords.words('english'))
    exclude = set(string.punctuation)
    lemma = WordNetLemmatizer()
    stopwordremoval = " ".join([i for i in document.lower().split() if i not in stop_words])
    punctuationremoval = ''.join(ch for ch in stopwordremoval if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punctuationremoval.split())
    return normalized
    

Next in a Jupyter Notebook, we will continue compiling the code. I decided to use Coldplay lyrics, because they’re my favourite! But you can choose any artist that can be found on the Genius API.

#libraries used to extract, clean and manipulate the data
from script import *
import pandas as pd
import numpy as np
import string

#plotting graph libraries
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn')

#library used to count the frequency of words/vectorizer 
from sklearn.feature_extraction.text import CountVectorizer

#the sentiment analysis model, tokenization
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import word_tokenize
import nltk.data
nltk.download('vader_lexicon')
nltk.download('punkt')

Here we imported our libraries and linked to our script.py code. We included libraries for pandas, numpy, and some plotting libraries, as well as important libraries for word vectorization and the sentiment model.

Next, we want to link to the Genius API. You will need to get an access token from them that will connect you to their services, this information can be found here.

#include here the access token from the Genius API and plug in artist info and how many songs you want to include 

access_token = "insert your access token key here"
df0 = search_data('Coldplay',10,access_token)

When the above code is run, it will list the top 10 Coldplay songs generated by the API. Next, we want to add the lyrics to a csv file.

#cleaning and transforming the data using functions created in the script.py file
df = clean_lyrics(df0,'lyric')
#filter data to use songs that have lyrics
df = df[df['lyric'].notnull()]
#Save the data into a csv file
df.to_csv('lyrics.csv',index=False)
df.head(10)

Next, we want to generate a word list from the lyrics that can be used to identify the sentiment of the song. This is done through defining a word list of unique words, while iterating through that list to append those words to a new column in the existing dataframe.

#here is where we create the word list to use for sentiment analysis 

def unique(list1):
   # initialize a null list
     unique_list = []
   # traverse for all elements
     for x in list1:
         # check if exists in unique_list or not, if not append to it
         if x not in unique_list:
              unique_list.append(x)
         #return the unique_list 
     return unique_list

#this stores the unique words of each lyrics into a new column called words
words = []
#iterate through each lyric and split unique words, appending the result into the words list
df = df.reset_index(drop=True)
for word in df['lyric'].tolist():
    words.append(unique(lyrics_to_words(word).split()))
#create the new column with the information of words lists
df['words'] = words
df.head()

Now, we want to clean the lyrics by implementing stop words that may be found within the lyrics themselves. This could be references to guitar solos, intro/outro references, or a reference to the singer by name. I removed the references to sorting by year because as of today, the API does not seem to generate that information any longer. I was only able to leverage information about the artist, title of the song, and the lyrics. The code below includes some of the stop words (here, I excluded Chris Martin and Beyonce) and counted the frequency of each word, then saved that information to a new dataframe. Then, this information was summed and saved to a new csv file entitled words.csv.

#Create a new dataframe of all the words used in lyrics
set_words = []
#Iterate through each word store them into new lists
for i in df.index:
   for word in df['words'].iloc[i]:
    set_words.append(word)
    #set_year.append(df['year'].iloc[i])
#create the new data frame with the information of words
words_df = pd.DataFrame({'words':set_words})
# here i am defining stopwords in case the clean data function does not remove all of them
stop_words = ['chris','martin','beyonce']
# count the frequency of each word that isn't in the stop words
cv = CountVectorizer(stop_words=stop_words)
#Create a dataframe called data_cv to store the the number of times the word was used in a lyric
text_cv = cv.fit_transform(words_df['words'].iloc[:])
data_cv = pd.DataFrame(text_cv.toarray(),columns=cv.get_feature_names())
#data_cv['year'] = words_df['year']
#here i created a dataframe that Sums the ocurrence frequency of each word
vect_words = data_cv.sum().T
vect_words = vect_words.reset_index(level=0).rename(columns ={'index':'words'})
vect_words = vect_words.rename_axis(columns='')
#Save the data into a csv file
vect_words.to_csv('words.csv',index=False)
vect_words = vect_words[['words']]
vect_words

Next, the cool part. Since we have all this data, we now want to understand the sentiment of each lyric. in order to do this, we need to attribute a score to each sentiment, in this case we are labeling sentiment as positive, negative, neutral, and compound, where compound generates a score that takes in the positive/negative/neutral scores and creates a new score. The closer the compound score is to +1, the more positive the sentiment. The closer the score is to -1, the more negative the sentiment.

#create lists to store the different scores for each word
#compound score shows the combination of pos/neg score, with the closer to +1 being a generally positive score,
#and -1 meaning generally more negative.
negative = []
neutral = []
positive = []
compound = []
#here we initialize the model
sid = SentimentIntensityAnalyzer()
#iterate for each row of lyrics and append the scores
for i in df.index:
    scores = sid.polarity_scores(df['lyric'].iloc[i])
    negative.append(scores['neg'])
    neutral.append(scores['neu'])
    positive.append(scores['pos'])
    compound.append(scores['compound'])
#create 4 columns to the main data frame for each score
df['negative'] = negative
df['neutral'] = neutral
df['positive'] = positive
df['compound'] = compound
df.head(10)

The scores that were generated were then appended to the dataframe in new columns entitled negative, neutral, positive, and compound.

Next, why don’t we try plotting this information? We can do that by generating a scatterplot in matplotlib, one of the libraries we imported earlier.

for name, group in df.groupby('title'):
    plt.scatter(group['positive'],group['negative'],label=name)
    plt.legend(fontsize=10)

plt.xlim([-0.05,0.7])
plt.ylim([-0.05,0.7])

plt.title("Song sentiment")
plt.xlabel('Positive sentiment ')
plt.ylabel('Negative sentiment')
plt.show()

This plot shows us the positive and negative sentiment of each title for the ten songs we appended from the API.

And that's it! You've created a simple sentiment analyzer for lyrics leveraging the Genius API. After this, the options are endless. My next goal is to find a way to sync the song sentiment to a light show, with each coloured light representing an emotion.

A big thank you to Cristobal Veas who produced the code tutorial for this.

script.py and the Jupyter Notebook with the code is available in this repo. Happy coding.