In this lesson you'll investigate another implementation of the Bayesian framework in order to classify youtube videos into the appropriate topic. The dataset you'll be investigating again come from kaggle. For further information, you can check out the original dataset here: https://www.kaggle.com/extralime/math-lectures
You will be able to:
- Implement document classification using naive Bayes
- Understand the need for the Laplacian smoothing correction
- Explain how to code a bag of words representation
A common example of using Bayes' theorem to classify documents is a spam filtering algorithm. You'll be exploring this application in the upcoming lab. To do this, you examine the question "given this word (in the document) what is the probability that it is spam versus not spam?" For example, perhaps you get a lot of "special offer" spam. In that case, the words "special" and "offer" may increase the probability that a given message is spam.
Recall Bayes Theorem:
Applied to a document, one common implementation of Bayes' theorem is to use a bag of words representation. A bag of words representation takes a text document and converts it into a word frequency representation. For example, in a bag of words representation, the message:
"Thomas Bayes was born in the early 1700s, although his exact date of birth is unknown. As a Presbyterian in England, he took an unconventional approach to eduction for his day since Oxford and Cambridge were tied to the Church of England."
Would look like this:
doc = "Thomas Bayes was born in the early 1700s, although his exact date of birth is unknown. As a Presbyterian in England, he took an unconventional approach to eduction for his day since Oxford and Cambridge were tied to the Church of England."
bag = {}
for word in doc.split():
bag[word] = bag.get(word, 0) + 1 #Get the previous entry, or 0 if not yet documented; add 1
bag
{'Thomas': 1,
'Bayes': 1,
'was': 1,
'born': 1,
'in': 2,
'the': 2,
'early': 1,
'1700s,': 1,
'although': 1,
'his': 2,
'exact': 1,
'date': 1,
'of': 2,
'birth': 1,
'is': 1,
'unknown.': 1,
'As': 1,
'a': 1,
'Presbyterian': 1,
'England,': 1,
'he': 1,
'took': 1,
'an': 1,
'unconventional': 1,
'approach': 1,
'to': 2,
'eduction': 1,
'for': 1,
'day': 1,
'since': 1,
'Oxford': 1,
'and': 1,
'Cambridge': 1,
'were': 1,
'tied': 1,
'Church': 1,
'England.': 1}
Additional preprocessing techniques can also be applied to the document before applying a bag of words representation, many of which you'll explore later when further investigating NLP techniques.
Once you've converted the document into a bag of words representation, you can then implement Bayes' Theorem. Returning to the case of 'Spam' and 'Not Spam', you would have:
Using the bag of words representation, you can then define
However, this formulation has a problem: what if you encounter a word in the test set that was not present in the training set? This new word would have frequency of zero! This would commit two grave sins. First, there would be a division by zero error. Secondly, the numerator would also be zero; if you were to simply modify the denominator, having a term with zero probability would cause the probability for the entire document to also be zero when you subsequently multiplied the conditional probabilities in Multinomial Bayes. To effectively counteract these issues, Laplacian smoothing is often used giving
Now, to implement this in python!
import pandas as pd
df = pd.read_csv('raw_text.csv')
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
text | label | |
---|---|---|
0 | The following content is\nprovided under a Cre... | Calculus |
1 | In this sequence of segments,\nwe review some ... | Probability |
2 | The following content is\nprovided under a Cre... | CS |
3 | The following\ncontent is provided under a Cre... | Algorithms |
4 | The following\ncontent is provided under a Cre... | Algorithms |
df.label.value_counts()
Linear Algebra 152
Probability 124
CS 104
Diff. Eq. 93
Algorithms 81
Statistics 79
Calculus 70
Data Structures 62
AI 48
Math for Eng. 28
NLP 19
Name: label, dtype: int64
To simplify the problem, you can start by subsetting to two specific classes.
df2 = df[df['label'].isin(['Algorithms', 'Statistics'])]
df2.label.value_counts()
Algorithms 81
Statistics 79
Name: label, dtype: int64
p_classes = dict(df2.label.value_counts(normalize=True))
p_classes
{'Algorithms': 0.50625, 'Statistics': 0.49375}
df2.iloc[0]
text The following\ncontent is provided under a Cre...
label Algorithms
Name: 3, dtype: object
from sklearn.model_selection import train_test_split
X = df2.text
y = df2.label
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17)
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)
class_word_freq = {} #Will be a nested dictionary of class_i : {word1:freq, word2:freq..., wordn:freq},.... class_m : {}
classes = train_df.label.unique()
for class_ in classes:
temp_df = train_df[train_df.label==class_]
bag = {}
for row in temp_df.index:
doc = temp_df['text'][row]
for word in doc.split():
bag[word] = bag.get(word, 0) + 1
class_word_freq[class_] = bag
vocabulary = set()
for text in train_df.text:
for word in text.split():
vocabulary.add(word)
V = len(vocabulary)
V
23977
def bag_it(doc):
bag = {}
for word in doc.split():
bag[word] = bag.get(word, 0) + 1
return bag
import numpy as np
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
bag = bag_it(doc)
classes = []
posteriors = []
for class_ in class_word_freq.keys():
p = p_classes[class_]
for word in bag.keys():
num = bag[word]+1
denom = class_word_freq[class_].get(word, 0) + V
p *= (num/denom)
classes.append(class_)
posteriors.append(p)
if return_posteriors:
print(posteriors)
return classes[np.argmax(posteriors)]
classify_doc(train_df.iloc[0]['text'], class_word_freq, p_classes, V, return_posteriors=True)
[0.0, 0.0]
'Algorithms'
As you can see from the output above, repeatedly multiplying small probabilities can lead to underflow; rounding to zero due to numerical approximation limitations. As such, a common alternative is to add the logarithms of the probabilities as opposed to multiplying the raw probabilities themselves. If this is alien to you, it might be worth reviewing some Algebra 2 rules of exponents and logarithms briefly:
$ e^x \cdot e^y = e^{x+y}$
$ log_{e}(e)=1 $
$ e^{log(x)} = x$
With that, here's an updated version of the function using log probabilities to avoid underflow:
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
bag = bag_it(doc)
classes = []
posteriors = []
for class_ in class_word_freq.keys():
p = np.log(p_classes[class_])
for word in bag.keys():
num = bag[word]+1
denom = class_word_freq[class_].get(word, 0) + V
p += np.log(num/denom)
classes.append(class_)
posteriors.append(p)
if return_posteriors:
print(posteriors)
return classes[np.argmax(posteriors)]
classify_doc(train_df.iloc[0]['text'], class_word_freq, p_classes, V, return_posteriors=True)
[-5578.267536771343, -5577.213285866603]
'Statistics'
classify_doc(train_df.iloc[10]['text'], class_word_freq, p_classes, V, return_posteriors=True)
[-2572.1544445158343, -2571.311308656896]
'Statistics'
classify_doc(train_df.iloc[12]['text'], class_word_freq, p_classes, V, return_posteriors=True)
[-4602.602622507951, -4601.755644621728]
'Statistics'
y_hat_train = X_train.map(lambda x: classify_doc(x, class_word_freq, p_classes, V))
residuals = y_train == y_hat_train
residuals.value_counts(normalize=True)
False 0.508333
True 0.491667
dtype: float64
As you can see, this algorithm leaves a lot to be desired. A measly 49% accuracy is nothing to write home about. (In fact, it's slightly worse then random guessing!) In practice, substantial additional preprocessing including removing stop words and using stemming or lemmatisation would be required. Even then, naive bayes might still not be the optimal algorithm. Nonetheless, it is a worthwhile exercise and a comprehendible algorithm.
In this lesson, you got to see another application of Bayes' Theorem as a means to do some rough documentation classification.