/Sec39Word-Vectorization

Primary LanguageJupyter NotebookOtherNOASSERTION

Word Vectorization - Lab

Introduction

In this lab, you'll tokenize and vectorize text documents, create and use a bag of words, and identify words unique to individual documents using TF-IDF vectorization.

Objectives

In this lab you will:

  • Implement tokenization and count vectorization from scratch
  • Implement TF-IDF from scratch
  • Use dimensionality reduction on vectorized text data to create and interpret visualizations

Let's get started!

Run the cell below to import everything necessary for this lab.

import pandas as pd
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.manifold import TSNE
from nltk.tokenize import word_tokenize
np.random.seed(0)

Our Corpus

In this lab, we'll be working with 20 different documents, each containing song lyrics from either Garth Brooks or Kendrick Lamar albums.

The songs are contained within the data subdirectory, contained within the same folder as this lab. Each song is stored in a single file, with files ranging from song1.txt to song20.txt.

To make it easy to read in all of the documents, use a list comprehension to create a list containing the name of every single song file in the cell below.

filenames = None

Next, create an empty DataFrame called songs_df. As we read in the songs and store and clean them, we'll store them in this DataFrame.

songs_df = None

Next, let's import a single song to see what our text looks like so that we can make sure we clean and tokenize it correctly.

In the cell below, read in and print out the lyrics from song11.txt. Use vanilla Python, no pandas needed.

# Import and print song11.txt

Tokenizing our Data

Before we can create a bag of words or vectorize each document, we need to clean it up and split each song into an array of individual words. Computers are very particular about strings. If we tokenized our data in its current state, we would run into the following problems:

  • Counting things that aren't actually words. In the example above, "[Kendrick]" is a note specifying who is speaking, not a lyric contained in the actual song, so it should be removed.
  • Punctuation and capitalization would mess up our word counts. To the Python interpreter, love, Love, Love?, and Love\n are all unique words, and would all be counted separately. We need to remove punctuation and capitalization, so that all words will be counted correctly.

Consider the following sentences from the example above:

"Love, let's talk about love\n", 'Is it anything and everything you hoped for?\n'

After tokenization, this should look like:

['love', 'let's', 'talk', 'about', 'love', 'is', 'it', 'anything', 'and', 'everything', 'you', 'hoped', 'for']

Tokenization is pretty tedious if we handle it manually, and would probably make use of regular expressions, which is outside the scope of this lab. In order to keep this lab moving, we'll use a library function to clean and tokenize our data so that we can move onto vectorization.

Tokenization is a required task for just about any Natural Language Processing (NLP) task, so great industry-standard tools exist to tokenize things for us, so that we can spend our time on more important tasks without getting bogged down hunting every special symbol or punctuation in a massive dataset. For this lab, we'll make use of the tokenizer in the amazing nltk library, which is short for Natural Language Tool Kit.

NOTE: NLTK requires extra installation methods to be run the first time certain methods are used. If nltk throws you an error about needing to install additional packages, follow the instructions in the error message to install the dependencies, and then rerun the cell.

Before we tokenize our songs, we'll do only a small manual bit of cleaning. In the cell below, write a function that allows us to remove lines that have ['artist names'] in it, to ensure that our song files contain only lyrics that are actually in the song. For the lines that remain, make every word lowercase, remove newline characters \n, and all the following punctuation marks: ",.'?!"

Test the function on test_song to show that it has successfully removed '[Kendrick Lamar:]' and other instances of artist names from the song and returned it.

def clean_song(song):
    pass

song_without_brackets = None
print(song_without_brackets)

Great. Now, write a function that takes in songs that have had their brackets removed, joins all of the lines into a single string, and then uses tokenize() on it to get a fully tokenized version of the song. Test this function on song_without_brackets to ensure that the function works.

def tokenize(song):
    pass

tokenized_test_song = None
tokenized_test_song[:10]

Great! Now that we can tokenize our songs, we can move onto vectorization.

Count Vectorization

Machine Learning algorithms don't understand strings. However, they do understand math, which means they understand vectors and matrices. By Vectorizing the text, we just convert the entire text into a vector, where each element in the vector represents a different word. The vector is the length of the entire vocabulary -- usually, every word that occurs in the English language, or at least every word that appears in our corpus. Any given sentence can then be represented as a vector where all the vector is 1 (or some other value) for each time that word appears in the sentence.

Consider the following example:

"I scream, you scream, we all scream for ice cream."
'aardvark' 'apple' [...] 'I' 'you' 'scream' 'we' 'all' 'for' 'ice' 'cream' [...] 'xylophone' 'zebra'
0 0 0 1 1 3 1 1 1 1 1 0 0 0

This is called a Sparse Representation, since the strong majority of the columns will have a value of 0. Note that elements corresponding to words that do not occur in the sentence have a value of 0, while words that do appear in the sentence have a value of 1 (or 1 for each time it appears in the sentence).

Alternatively, we can represent this sentence as a plain old Python dictionary of word frequency counts:

BoW = {
    'I':1,
    'you':1,
    'scream':3,
    'we':1,
    'all':1,
    'for':1,
    'ice':1,
    'cream':1
}

Both of these are examples of Count Vectorization. They allow us to represent a sentence as a vector, with each element in the vector corresponding to how many times that word is used.

Positional Information and Bag of Words

Notice that when we vectorize a sentence this way, we lose the order that the words were in. This is the Bag of Words approach mentioned earlier. Note that sentences that contain the same words will create the same vectors, even if they mean different things -- e.g. 'cats are scared of dogs' and 'dogs are scared of cats' would both produce the exact same vector, since they contain the same words.

In the cell below, create a function that takes in a tokenized, cleaned song and returns a count vectorized representation of it as a Python dictionary. Add in an optional parameter called vocab that defaults to None. This way, if we are using a vocabulary that contains words not seen in the song, we can still use this function by passing it into the vocab parameter.

Hint: Consider using a set() to make this easier!

def count_vectorize(song, vocab=None):
    pass

test_vectorized = None
print(test_vectorized)

Great! You've just successfully vectorized your first text document! Now, let's look at a more advanced type of vectorization, TF-IDF!

TF-IDF Vectorization

TF-IDF stands for Term Frequency, Inverse Document Frequency. This is a more advanced form of vectorization that weighs each term in a document by how unique it is to the given document it is contained in, which allows us to summarize the contents of a document using a few key words. If the word is used often in many other documents, it is not unique, and therefore probably not too useful if we wanted to figure out how this document is unique in relation to other documents. Conversely, if a word is used many times in a document, but rarely in all the other documents we are considering, then it is likely a good indicator for telling us that this word is important to the document in question.

The formula TF-IDF uses to determine the weights of each term in a document is Term Frequency multiplied by Inverse Document Frequency, where the formula for Term Frequency is:

$$\large Term\ Frequency(t) = \frac{number\ of\ times\ t\ appears\ in\ a\ document} {total\ number\ of\ terms\ in\ the\ document} $$

Complete the following function below to calculate term frequency for every term in a document.

def term_frequency(BoW_dict):
    pass

test = None
print(list(test)[10:20])

Now that we have this, we can easily calculate Inverse Document Frequency. In the cell below, complete the following function. this function should take in the list of dictionaries, with each item in the list being a bag of words representing the words in a different song. The function should return a dictionary containing the inverse document frequency values for each word.

The formula for Inverse Document Frequency is:



$$\large IDF(t) = log_e(\frac{Total\ Number\ of\ Documents}{Number\ of\ Documents\ with\ t\ in\ it})$$

def inverse_document_frequency(list_of_dicts):
    pass

Computing TF-IDF

Now that we can compute both Term Frequency and Inverse Document Frequency, computing an overall TF-IDF value is simple! All we need to do is multiply the two values.

In the cell below, complete the tf_idf() function. This function should take in a list of dictionaries, just as the inverse_document_frequency() function did. This function returns a new list of dictionaries, with each dictionary containing the tf-idf vectorized representation of a corresponding song document.

NOTE: Each document should contain the full vocabulary of the entire combined corpus.

def tf_idf(list_of_dicts):
    pass

Vectorizing All Documents

Now that we've created all the necessary helper functions, we can load in all of our documents and run each through the vectorization pipeline we've just created.

In the cell below, complete the main() function. This function should take in a list of file names (provided for you in the filenames list we created at the start), and then:

  • Read in each document
  • Tokenize each document
  • Convert each document to a bag of words (dictionary representation)
  • Return a list of dictionaries vectorized using tf-idf, where each dictionary is a vectorized representation of a document
def main(filenames):
    pass

tf_idf_all_docs = None
print(list(tf_idf_all_docs[0])[:10])

Visualizing our Vectorizations

Now that we have a tf-idf representation of each document, we can move on to the fun part -- visualizing everything!

In the cell below, examine our dataset to figure out how many dimensions our dataset has.

HINT: Remember that every word is its own dimension!

num_dims = None
print("Number of Dimensions: {}".format(num_dims))

There are too many dimensions for us to visualize! In order to make it understandable to human eyes, we'll need to reduce it to 2 or 3 dimensions.

To do this, we'll use a technique called t-SNE (short for t-Stochastic Neighbors Embedding). This is too complex for us to code ourselves, so we'll make use of scikit-learn's implementation of it.

First, we need to pull the words out of the dictionaries stored in tf_idf_all_docs so that only the values remain, and store them in lists instead of dictionaries. This is because the t-SNE only works with array-like objects, not dictionaries.

In the cell below, create a list of lists that contains a list representation of the values of each of the dictionaries stored in tf_idf_all_docs. The same structure should remain -- e.g. the first list should contain only the values that were in the first dictionary in tf_idf_all_docs, and so on.

tf_idf_vals_list = []

for i in tf_idf_all_docs:
    tf_idf_vals_list.append(list(i.values()))
    
tf_idf_vals_list[0][:10]

Now that we have only the values, we can use the TSNE() class from sklearn to transform our data appropriately. In the cell below, instantiate TSNE() with n_components=3. Then, use the created object's .fit_transform() method to transform the data stored in tf_idf_vals_list into 3-dimensional data. Then, inspect the newly transformed data to confirm that it has the correct dimensionality.

t_sne_object_3d = None
transformed_data_3d = None
transformed_data_3d

We'll also want to check out how the visualization looks in 2d. Repeat the process above, but this time, instantiate TSNE() with 2 components instead of 3. Again, use .fit_transform() to transform the data and store it in the variable below, and then inspect it to confirm the transformed data has only 2 dimensions.

t_sne_object_2d = None
transformed_data_2d = None
transformed_data_2d

Now, let's visualize everything! Run the cell below to view both 3D and 2D visualizations of the songs.

kendrick_3d = transformed_data_3d[:10]
k3_x = [i[0] for i in kendrick_3d]
k3_y = [i[1] for i in kendrick_3d]
k3_z = [i[2] for i in kendrick_3d]

garth_3d = transformed_data_3d[10:]
g3_x = [i[0] for i in garth_3d]
g3_y = [i[1] for i in garth_3d]
g3_z = [i[2] for i in garth_3d]

fig = plt.figure(figsize=(10,5))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(k3_x, k3_y, k3_z, c='b', s=60, label='Kendrick')
ax.scatter(g3_x, g3_y, g3_z, c='red', s=60, label='Garth')
ax.view_init(30, 10)
ax.legend()
plt.show()

kendrick_2d = transformed_data_2d[:10]
k2_x = [i[0] for i in kendrick_2d]
k2_y = [i[1] for i in kendrick_2d]

garth_2d = transformed_data_2d[10:]
g2_x = [i[0] for i in garth_2d]
g2_y = [i[1] for i in garth_2d]

fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(222)
ax.scatter(k2_x, k2_y, c='b', label='Kendrick')
ax.scatter(g2_x, g2_y, c='red', label='Garth')
ax.legend()
plt.show()

Interesting! Take a crack at interpreting these graphs by answering the following questions below:

What does each graph mean? Do you find one graph more informative than the other? Do you think that this method shows us discernable differences between Kendrick Lamar songs and Garth Brooks songs? Use the graphs and your understanding of TF-IDF to support your answer.

Write your answer to this question below this line:


Both graphs show a basic trend among the red and blue dots, although the 3-dimensional graph is more informative than the 2-dimensional graph. We see a separation between the two artists because they both have words that they use, but the other artist does not. The words in each song that are common to both are reduced to very small numbers or to 0, because of the log operation in the IDF function. This means that the elements of each song vector with the highest values will be the ones that have words that are unique to that specific document, or at least are rarely used in others.

Summary

In this lab, you learned how to:

  • Tokenize a corpus of words and identify the different choices to be made while parsing them
  • Use a count vectorization strategy to create a bag of words
  • Use TF-IDF vectorization with multiple documents to identify words that are important/unique to certain documents
  • Visualize and compare vectorized text documents