In this lab, we'll learn how totokenize and vectorize text documents, create an use a Bag of Words, and identify words unique to individual documents using TF-IDF Vectorization.
- Tokenize a corpus of words and identify the different choices to be made while parsing them.
- Use a Count Vectorization strategy to create a Bag of Words
- Use TF-IDF Vectorization with multiple documents to identify words that are important/unique to certain documents.
Run the cell below to import everything necessary for this lab.
import pandas as pd
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.manifold import TSNE
from nltk.tokenize import word_tokenize
np.random.seed(0)
In this lab, we'll be working with 20 different documents, each containing song lyrics from either Garth Brooks or Kendrick Lamar albums.
The songs are contained within the data
subdirectory, contained within the same folder as this lab. Each song is stored in a single file, with files ranging from song1.txt
to song20.txt
.
To make it easy to read in all of the documents, use a list comprehension to create a list containing the name of every single song file in the cell below.
filenames = None
Next, let's import a single song to see what our text looks like so that we can make sure we clean and tokenize it correctly.
In the cell below, read in and print out the lyrics from song11.txt
. Use vanilla python, no pandas needed.
# read in song11.txt here
Before we can create a Bag of Words or vectorize each document, we need to clean it up and split each song into an array of individual words. Computers are very particular about strings. If we tokenized our data in it's current state, we would run into the following problems:
- Counting things that aren't actually words. In the example above,
"[Kendrick]"
is a note specifying who is speaking, not a lyric contained in the actual song, so it should be removed. - Punctuation and capitalization would mess up our word counts. To the python interpreter,
love
,Love
,Love?
, andLove\n
are all unique words, and would all be counted separately. We need to remove punctuation and capitalization, so that all words will be counted correctly.
Consider the following sentences from the example above:
"Love, let's talk about love\n", 'Is it anything and everything you hoped for?\n'
After tokenization, this should look like:
['love', 'let's', 'talk', 'about', 'love', 'is', 'it', 'anything', 'and', 'everything', 'you', 'hoped', 'for']
Tokenization is pretty tedious if we handle it manually, and would probably make use of Regular Expressions, which is outside the scope of this lab. In order to keep this lab moving, we'll use a library function to clean and tokenize our data so that we can move onto vectorization.
Tokenization is a required task for just about any Natural Language Processing (NLP) task, so great industry-standard tools exist to tokenize things for us, so that we can spend our time on more important tasks without getting bogged down hunting every special symbol or punctuation in a massive dataset. For this lab, we'll make use of the tokenizer in the amazing nltk
library, which is short for Natural Language Tool Kit.
NOTE: NLTK requires extra installation methods to be run the first time certain methods are used. If nltk
throws you an error about needing to install additional packages, follow the instructions in the error message to install the dependencies, and then rerun the cell.
In this case, you may need to run the following code:
import nltk
nltk.download('punkt')
to download the Punkt sentence tokenizer.
Before we tokenize our songs, we'll do only a small manual bit of cleaning. In the cell below, write a function that allows us to remove lines that have ['artist names']
in it, to ensure that our song files contain only lyrics that are actually in the song. For the lines that remain, make every word lowercase, remove newline characters \n
, and any of the following punctuation marks: ",.'?!"
Test the function on test_song
to show that it has successfully removed '[Kendrick Lamar:]'
and other instances of artist names from the song and returned it.
def clean_song(song):
pass
song_without_brackets = None
song_without_brackets
print(song_without_brackets)
Great. Now, write a function that takes in songs that have had their brackets removed, joins all of the lines into a single string, and then uses tokenize()
on it to get a fully tokenized version of the song. Test this funtion on song_without_brackets
to ensure that the function works.
def tokenize(song):
pass
tokenized_test_song = None
tokenized_test_song[:10]
Great! Now that we know the ability to tokenize our songs, we can move onto Vectorization.
Machine Learning algorithms don't understand strings. However, they do understand math, which means they understand vectors and matrices. By Vectorizing the text, we just convert the entire text into a vector, where each element in the vector represents a different word. The vector is the length of the entire vocabulary--usually, every word that occurs in the English language, or at least every word that appears in our corpus. Any given sentence can then be represented as a vector where all the vector is 1 (or some other value) for each time that word appears in the sentence.
Consider the following example:
"I scream, you scream, we all scream for ice cream."'aardvark' | 'apple' | [...] | 'I' | 'you' | 'scream' | 'we' | 'all' | 'for' | 'ice' | 'cream' | [...] | 'xylophone' | 'zebra' |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
This is called a Sparse Representation, since the strong majority of the columns will have a value of 0. Note that elements corresponding to words that do not occur in the sentence have a value of 0, while words that do appear in the sentence have a value of 1 (or 1 for each time it appears in the sentence).
Alternatively, we can represent this sentence as a plain old python dictionary of word frequency counts:
BoW = {
'I':1,
'you':1,
'scream':3,
'we':1,
'all':1,
'for':1,
'ice':1,
'cream':1
}
Both of these are examples of Count Vectorization. They allow us to represent a sentence as a vector, with each element in the vector corresponding to how many times that word is used.
Notice that when we vectorize a sentence this way, we lose the order that the words were in. This is the Bag of Words approach mentioned earlier. Note that sentences that contain the same words will create the same vectors, even if they mean different things--e.g. 'cats are scared of dogs'
and 'dogs are scared of cats'
would both produce the exact same vector, since they contain the same words.
In the cell below, create a function that takes in a tokenized, cleaned song and returns a Count Vectorized representation of it as a python dictionary. Add in an optional parameter called vocab
that defaults to None
. This way, if we are using a vocabulary that contains words not seen in the song, we can still use this function by passing it in to the vocab
parameter.
Hint: Consider using a set
object to make this easier!
def count_vectorize(song, vocab=None):
pass
test_vectorized = None
print(test_vectorized)
Great! You've just successfully vectorized your first text document! Now, let's look at a more advanced type of vectorization, TF-IDF!
TF-IDF stands for Term Frequency, Inverse Document Frequency. This is a more advanced form of vectorization that weights each term in a document by how unique it is to the given document it is contained in, which allows us to summarize the contents of a document using a few key words. If the word is used often in many other documents, it is not unique, and therefore probably not too useful if we wanted to figure out how this document is unique in relation to other documents. Conversely, if a word is used many times in a document, but rarely in all the other documents we are considering, then it is likely a good indicator for telling us that this word is important to the document in question.
The formula TF-IDF uses to determine the weights of each term in a document is Term Frequency multipled by Inverse Document Frequency, where the formula for Term Frequency is:
Complete the following function below to calculate term frequency for every term in a document.
def term_frequency(BoW_dict):
pass
test = None
print(list(test)[10:20])
The formula for Inverse Document Frequency is:
Now that we have this, we can easily calculate Inverse Document Frequency. In the cell below, complete the following function. this function should take in the list of dictionaries, with each item in the list being a Bag of Words representing the words in a different song. The function should return a dictionary containing the inverse document frequency values for each word.
def inverse_document_frequency(list_of_dicts):
pass
Now that we can compute both Term Frequency and Inverse Document Frequency, computing an overall TF-IDF value is simple! All we need to do is multiply the two values.
In the cell below, complete the tf_idf()
function. This function should take in a list of dictionaries, just as the inverse_document_frequency()
function did. This function return a new list of dictionaries, with each dictionary containing the tf-idf vectorized representation of a corresponding song document.
NOTE: Each document should contain the full vocabulary of the entire combined corpus.
def tf_idf(list_of_dicts):
pass
Now that we've created all the necessary helper functions, we can load in all of our documents and run each through the vectorization pipeline we've just created.
In the cell below, complete the main
function. This function should take in a list of file names (provided for you in the filenames
list we created at the start), and then:
- Read in each document
- Tokenize each document
- Convert each document to a Bag of Words (dictionary representation)
- Return a list of dictionaries vectorized using tf-idf, where each dictionary is a vectorized representation of a document.
HINT: Remember that all files are stored in the data/
directory. Be sure to append this to the filename when reading in each file, otherwise the path won't be correct!
def main(filenames):
pass
tf_idf_all_docs = None
print(list(tf_idf_all_docs[0])[:10])
Now that we have a tf-idf representation each document, we can move on to the fun part--visualizing everything!
Let's investigate how many dimensions our data currently has. In the cell below, examine our dataset to figure out how many dimensions our dataset has.
HINT: Remember that every word is it's own dimension!
num_dims = len(tf_idf_all_docs[0])
print("Number of Dimensions: {}".format(num_dims))
That's much too high-dimensional for us to visualize! In order to make it understandable to human eyes, we'll need to reduce dimensionality to 2 or 3 dimensions.
To do this, we'll use a technique called t-SNE (short for t-Stochastic Neighbors Embedding). This is too complex for us to code ourselves, so we'll make use of sklearn's implementation of it.
First, we need to pull the words out of the dictionaries stored in tf_idf_all_docs
so that only the values remain, and store them in lists instead of dictionaries. This is because the t-SNE object only works with Array-like objects, not dictionaries.
In the cell below, create a list of lists that contains a list representation of the values of each of the dictionaries stored in tf_idf_all_docs
. The same structure should remain--e.g. the first list should contain only the values that were in the 1st dictionary in tf_idf_all_docs
, and so on.
tf_idf_vals_list = []
for i in tf_idf_all_docs:
tf_idf_vals_list.append(list(i.values()))
tf_idf_vals_list[0][:10]
Now that we have only the values, we can use the TSNE
object from sklearn
to transform our data appropriately. In the cell below, create a TSNE
with n_components=3
passed in as a parameter. Then, use the created object's fit_transform()
method to transform the data stored in tf_idf_vals_list
into 3-dimensional data. Then, inspect the newly transformed data to confirm that it has the correct dimensionality.
t_sne_object_3d = None
transformed_data_3d = None
transformed_data_3d
We'll also want to check out how the visualization looks in 2d. Repeat the process above, but this time, create a TSNE
object with 2 components instead of 3. Again, use fit_transform()
to transform the data and store it in the variable below, and then inspect it to confirm the transformed data has only 2 dimensions.
t_sne_object_2d = None
transformed_data_2d = None
transformed_data_2d
Now, let's visualize everything! Run the cell below to a 3D visualization of the songs.
kendrick_3d = transformed_data_3d[:10]
k3_x = [i[0] for i in kendrick_3d]
k3_y = [i[1] for i in kendrick_3d]
k3_z = [i[2] for i in kendrick_3d]
garth_3d = transformed_data_3d[10:]
g3_x = [i[0] for i in garth_3d]
g3_y = [i[1] for i in garth_3d]
g3_z = [i[2] for i in garth_3d]
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(k3_x, k3_y, k3_z, c='b', s=60, label='Kendrick')
ax.scatter(g3_x, g3_y, g3_z, c='red', s=60, label='Garth')
ax.view_init(30, 10)
ax.legend()
plt.show()
kendrick_2d = transformed_data_2d[:10]
k2_x = [i[0] for i in kendrick_2d]
k2_y = [i[1] for i in kendrick_2d]
garth_2d = transformed_data_2d[10:]
g2_x = [i[0] for i in garth_2d]
g2_y = [i[1] for i in garth_2d]
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(222)
ax.scatter(k2_x, k2_y, c='b', label='Kendrick')
ax.scatter(g2_x, g2_y, c='red', label='Garth')
ax.legend()
plt.show()
Interesting! Take a crack at interpreting these graphs by answering the following question below:
What does each graph mean? Do you find one graph more informative than the other? Do you think that this method shows us discernable differences between Kendrick Lamar songs and Garth Brooks songs? Use the graphs and your understanding of TF-IDF to support your answer.
Write your answer to this question below:
# Your Written Answer Here
In this lab, we learned how to:
- Tokenize a corpus of words and identify the different choices to be made while parsing them.
- Use a Count Vectorization strategy to create a Bag of Words
- Use TF-IDF Vectorization with multiple documents to identify words that are important/unique to certain documents.
- Visualize and compare vectorized text documents.