CA3 - Naive Bayes Classification

Yasaman Jafari

810195376

In this project, Naive Bayes is used to classify poems of two persian poets, "Hafez" and "Saadi".

import pandas as pd
import re
import math

First, we read the data from .csv file.

data = pd.read_csv("./Data/train_test.csv", encoding="utf-8")
data.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	text	label
0	چون می‌رود این کشتی سرگشته که آخر	hafez
1	که همین بود حد امکانش	saadi
2	ارادتی بنما تا سعادتی ببری	hafez
3	خدا را زین معما پرده بردار	hafez
4	گویی که در برابر چشمم مصوری	saadi

As you can see above, we are provided with about 20000 samples and each one is labeled with its poet.

print(data["text"][0])

چون می‌رود این کشتی سرگشته که آخر

First, we need to separate our data into two parts of train set and test set. Eighty percent of data is considered as train set. We need to shuffle the data and then choose 80% for train and leave the rest for test.

We choose a random subset for train and test so that it is not sorted in any specific order and it can represent the entire poems collection better.

train = data.sample(frac = 0.8, random_state = 42)
test = data.drop(train.index)

print("Train Percentage: ", len(train) / (len(test) + len(train)))
print("Test Percentage: ", len(test) / (len(test) + len(train)))

Train Percentage:  0.7999904255828426
Test Percentage:  0.20000957441715736

train.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	text	label
3128	سه ماه می خور و نه ماه پارسا می‌باش	hafez
8157	زاهد بنگر نشسته دلتنگ	saadi
6682	ولیکن تا به چوگان می‌زنندش	saadi
11526	تا فخر دین عبدالصمد باشد که غمخواری کند	hafez
7477	تیغ جفا گر زنی ضرب تو آسایشست	saadi

Now we need to find all the words in the poems. For this I created a new column which keeps the list of words used in each poem.

train['index'] = train.index
train['words'] = train.text.str.split().to_frame()

train.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	text	label	index	words
3128	سه ماه می خور و نه ماه پارسا می‌باش	hafez	3128	[سه, ماه, می, خور, و, نه, ماه, پارسا, می‌باش]
8157	زاهد بنگر نشسته دلتنگ	saadi	8157	[زاهد, بنگر, نشسته, دلتنگ]
6682	ولیکن تا به چوگان می‌زنندش	saadi	6682	[ولیکن, تا, به, چوگان, می‌زنندش]
11526	تا فخر دین عبدالصمد باشد که غمخواری کند	hafez	11526	[تا, فخر, دین, عبدالصمد, باشد, که, غمخواری, کند]
7477	تیغ جفا گر زنی ضرب تو آسایشست	saadi	7477	[تیغ, جفا, گر, زنی, ضرب, تو, آسایشست]

In order to find all the words used in the set, I combined all the lists and then converted it to a set in order to remove duplicates.

Some words are commonly used in persian and are not useful in classification.

stop_words = ['دل', 'گر', 'ما', 'هر', 'با', 'ای', 'سر', 'تا', 'چو', 'نه']

Now, I'll separate train data of Hafez and Saadi to do some process on them.

hafez_train = train[train['label'] == "hafez"].drop(['text', 'label', 'index'], axis=1)
saadi_train = train[train['label'] == "saadi"].drop(['text', 'label', 'index'], axis=1)

print("Hafez Count: ", len(hafez_train))
print("Saadi Count: ", len(saadi_train))

Hafez Count:  6753
Saadi Count:  9958

I'll consider each word as a feature. For this purpose, I need to find all distinct words in the train set.

words = []
for poem in train['words']:
    words += poem
words = list(set(words))
print("distinct_words_count: ", len(words))

distinct_words_count:  12645

find_prob() function is used to we need to find the conditional probability of each word given the poet. For each word we need to find the number of occurances of it in a poet's poems and divide it by the total number of words used by that poet.

def find_prob(hafez_train, saadi_train):
    hafez_words = []
    for poem in hafez_train['words']:
        hafez_words += poem
    print("Hafez_count: ", len(hafez_words))
    
    saadi_words = []
    for poem in saadi_train['words']:
        saadi_words += poem
    print("Saadi_count: ", len(saadi_words))
    
    train_all_word_count = pd.DataFrame(columns=['hafez_count', 'saadi_count'])
    for word in words:
        train_all_word_count = train_all_word_count.append({'word': word, 'hafez_count': hafez_words.count(word), 'saadi_count': saadi_words.count(word)}, ignore_index=True)
        
    train_all_word_count = train_all_word_count.set_index('word')
    train_all_word_count['hafez_prob'] = train_all_word_count['hafez_count'] / train_all_word_count['hafez_count'].sum()
    train_all_word_count['saadi_prob'] = train_all_word_count['saadi_count'] / train_all_word_count['saadi_count'].sum()
    
    return train_all_word_count, hafez_words, saadi_words

train_all_word_count, hafez_words, saadi_words = find_prob(hafez_train, saadi_train)

Hafez_count:  50650
Saadi_count:  70650

train_all_word_count.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	hafez_count	saadi_count	hafez_prob	saadi_prob
word
دلبند	1	8	1.97433e-05	0.000113234
بدبین	1	0	1.97433e-05	0
احوال	7	4	0.000138203	5.66171e-05
خودپسند	1	0	1.97433e-05	0
شهد	3	7	5.923e-05	9.908e-05

The prior probabilities are calculated below. The prior probability of each poets in the total number of that poet's poems divided by all poems by both poets.

hafez_prob = len(hafez_train) / (len(hafez_train) + len(saadi_train))
saadi_prob = len(saadi_train) / (len(hafez_train) + len(saadi_train))

print("Hafez Probability: ", hafez_prob)
print("Saadi Probability: ", saadi_prob)

Hafez Probability:  0.4041050804859075
Saadi Probability:  0.5958949195140926

At this step, we will eliminate the words which are only used once by Hafez or Saadi as they cannot be distinctive.

# train_all_word_count['all_count'] = train_all_word_count['hafez_count'] + train_all_word_count['saadi_count']
# one_occurance = train_all_word_count[train_all_word_count['all_count'] == 1]
# once_used = list(one_occurance.index)
# words = list(set(words) - set(once_used))

# train_all_word_count = find_prob()

Operate On Test Data

Now we have built our model and need to predict the poet of the test data.

test['index'] = test.index
test['words'] = test.text.str.split().to_frame()
test.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	text	label	index	words
9	رفتی و همچنان به خیال من اندری	saadi	9	[رفتی, و, همچنان, به, خیال, من, اندری]
11	آنجا که تویی رفتن ما سود ندارد	saadi	11	[آنجا, که, تویی, رفتن, ما, سود, ندارد]
13	اندرونم با تو می‌آید ولیک	saadi	13	[اندرونم, با, تو, می‌آید, ولیک]
16	که خوش آهنگ و فرح بخش هوایی دارد	hafez	16	[که, خوش, آهنگ, و, فرح, بخش, هوایی, دارد]
24	ناودان چشم رنجوران عشق	saadi	24	[ناودان, چشم, رنجوران, عشق]

In Naive Bayes, we have the strong assumption of independence between each two features.

So, the probability of each poet given the words is proportional to the multiplication of all the probabilities of each word given the poet multiplied by the prior probability which is the probability of each poet in general.

After calculating this probability for Hafez and Saadi, we compare them and decide based on the result.

def predict(df):
    for index, row in df.iterrows(): 
        curr_hafez_prob = len(hafez_train) / (len(hafez_train) + len(saadi_train))
        curr_saadi_prob = len(saadi_train) / (len(hafez_train) + len(saadi_train))
        for word in set(row["words"]):
            if word in words:
                curr_hafez_prob *= train_all_word_count.at[word, 'hafez_prob']
                curr_saadi_prob *= train_all_word_count.at[word, 'saadi_prob']
        df.at[index, 'hafez_prob'] = curr_hafez_prob
        df.at[index, 'saadi_prob'] = curr_saadi_prob

    df['prediction_is_hafez'] = df['hafez_prob'] >= df['saadi_prob']

    prediction_poet = {True: 'hafez', False: 'saadi'}
    df['prediction'] = df['prediction_is_hafez'].map(prediction_poet)

predict(test)
test.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	text	label	index	words	hafez_prob	saadi_prob	prediction_is_hafez	prediction
9	رفتی و همچنان به خیال من اندری	saadi	9	[رفتی, و, همچنان, به, خیال, من, اندری]	0.000000e+00	6.067397e-21	False	saadi
11	آنجا که تویی رفتن ما سود ندارد	saadi	11	[آنجا, که, تویی, رفتن, ما, سود, ندارد]	2.686915e-23	5.948826e-22	False	saadi
13	اندرونم با تو می‌آید ولیک	saadi	13	[اندرونم, با, تو, می‌آید, ولیک]	0.000000e+00	1.970144e-16	False	saadi
16	که خوش آهنگ و فرح بخش هوایی دارد	hafez	16	[که, خوش, آهنگ, و, فرح, بخش, هوایی, دارد]	7.862324e-25	0.000000e+00	True	hafez
24	ناودان چشم رنجوران عشق	saadi	24	[ناودان, چشم, رنجوران, عشق]	4.217595e-06	8.562444e-06	False	saadi

In order to evaluate how good our model is, we use:

Recall
Precision
Accuracy

def evaluate(df):
    df['correct'] = (df['label'] == df['prediction'])
    correct_count = (df['correct']).sum()
    correct_hafez = (df[['correct', 'prediction_is_hafez']].all(axis='columns')).sum()
    all_hafez = (df['label'] == 'hafez').sum()
    all_hafez_detected = (df['prediction'] == 'hafez').sum()
    accuracy = correct_count / len(test)
    precision = correct_hafez / all_hafez_detected
    recall = correct_hafez / all_hafez
    print("Recall: ", recall)
    print("Precision: ", precision)
    print("Accuracy: ", accuracy)

evaluate(test)

Recall:  0.7225225225225225
Precision:  0.7229567307692307
Accuracy:  0.7790808999521303

Laplace Smoothing

If a word is used only in one poet's work, the probability of it given the other poet, will be zero and as we multiply the probabilities, the result will be zero not considering any other features.

In order to fix this, we add a fixed alpha to the count of all words in its poet's collection and add count * alpha to the denominator.

alpha = 0.5
train_all_word_count['hafez_prob'] = (train_all_word_count['hafez_count'] + alpha) / (train_all_word_count['hafez_count'].sum() + (len(set(hafez_words + saadi_words))* alpha))
train_all_word_count['saadi_prob'] = (train_all_word_count['saadi_count'] + alpha) / (train_all_word_count['saadi_count'].sum() + (len(set(saadi_words + hafez_words))* alpha))

print(train_all_word_count['hafez_prob'].sum())
print(train_all_word_count['saadi_prob'].sum())

1.000000000000271
1.000000000000061

predict(test)

evaluate(test)

Recall:  0.7375375375375376
Precision:  0.7767235926628716
Accuracy:  0.8109143130684539

Evaluate

data['index'] = data.index
data['words'] = data.text.str.split().to_frame()

hafez_data = data[data['label'] == "hafez"].drop(['text', 'label', 'index'], axis=1)
saadi_data = data[data['label'] == "saadi"].drop(['text', 'label', 'index'], axis=1)
words = []
for poem in data['words']:
    words += poem
words = list(set(words))
print("distinct_words_count: ", len(words))

train_all_word_count, hafez_words, saadi_words = find_prob(hafez_data, saadi_data)

eval_data = pd.read_csv("./Data/evaluate.csv", encoding="utf-8")

alpha = 0.5
train_all_word_count['hafez_prob'] = (train_all_word_count['hafez_count'] + alpha) / (train_all_word_count['hafez_count'].sum() + (len(set(hafez_words + saadi_words)) * alpha))
train_all_word_count['saadi_prob'] = (train_all_word_count['saadi_count'] + alpha) / (train_all_word_count['saadi_count'].sum() + (len(set(hafez_words + saadi_words)) * alpha))

eval_data['index'] = eval_data.id
eval_data['words'] = eval_data.text.str.split().to_frame()

predict(eval_data)

distinct_words_count:  14084
Hafez_count:  63077
Saadi_count:  88560

eval_data.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	id	text	index	words	hafez_prob	saadi_prob	prediction_is_hafez	prediction
0	1	ور بی تو بامداد کنم روز محشر است	1	[ور, بی, تو, بامداد, کنم, روز, محشر, است]	3.001295e-26	7.121545e-24	False	saadi
1	2	ساقی بیار جامی کز زهد توبه کردم	2	[ساقی, بیار, جامی, کز, زهد, توبه, کردم]	2.353900e-23	1.252923e-25	True	hafez
2	3	مرا هرآینه خاموش بودن اولی‌تر	3	[مرا, هرآینه, خاموش, بودن, اولی‌تر]	3.377382e-22	2.224892e-20	False	saadi
3	4	تو ندانی که چرا در تو کسی خیره بماند	4	[تو, ندانی, که, چرا, در, تو, کسی, خیره, بماند]	4.951443e-25	6.114006e-23	False	saadi
4	5	کاینان به دل ربودن مردم معینند	5	[کاینان, به, دل, ربودن, مردم, معینند]	1.256054e-23	5.174989e-22	False	saadi

output = pd.DataFrame({
    "id": eval_data['index'],
    "label": eval_data['prediction'],
})
output.to_csv('output.csv', index=False)

output.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	id	label
0	1	saadi
1	2	hafez
2	3	saadi
3	4	saadi
4	5	saadi

Report Questions and Explanations

Parameters

In my algorithm, each distinct word is considered a feature.

Bayesian probability consists of 4 parts:

Prior
Posterior
Likelihood
Evidence

$P(Poet|W_0, W_1, W_2, ..., W_n) = \frac{P(Poet)P(W_0, W_1, W_2, ..., W_n|Poet)}{P(W_0, W_1, W_2, ..., W_n)}$

In the equation above, we have:

Prior: P(Poet)
Likelihood: P(W0, W1, W2, ..., Wn|Poet)
Evidence: P(W0, W1, W2, ..., Wn)
Posterior: P(Poet|W0, W1, W2, ..., Wn)

In other words, Prior is the probability of each poet in general, this means how probable it is for a poem to be for a certain poet in general not considering any other data. In order to calculate the prior, for each poet, we divide the number of that poet's poems by the number of all given poems.

Likelihood is the probability of each word of a poem given the poet. In Naive Bayes, each feature is independent of others, so this is the multiplication of the probabilities of each word given the poet. This means how probable it is for a certain poet to use that word. To calculate this, we multiply the probabilities of each word given the poet. The probaility of each word given the poet is the number of times that word is used in that poet's works divided by the total number of words in that poet's work.

Evidence is the probabiliy of all words that we have in a given poem. We do not need to calculate this as it is the same for both poets and does not change the result of comparison. If we wanted to calculate this, we could multiply the probabilites of all words. The probability of each word is the number of its occurance divided by the count of all words.

Posterior is the probaility of a poet given the words in a poem. We use bayesian rule stated below to calculate this.

$$ P(c|X) = \frac{P(c)\times\prod_{i=1}^{m} P(x_i|c)}{P(X)} $$

Extra Questions

1. What is the problem if we only use precision to evaluate our model?

If we only use precision for model evaluation, we can get a 100 percent precision if we manage to correctly guess only one poem of the corresponding poet, and predict the other poet for all other given poems.

$$ Precision = \frac{True Positive}{True Positive + False Positive} $$

In other words if we are able to predict one of Hafez's poems correctly and assign it to Hafez and then assign all the other poems to Saadi, the precision for Hafez will be 100%.

2. Why isn't accuracy enough for evaluating the model?

If the majority of the data belongs to a specific class, accuracy will not be a good measure to evaluate our model. For instance if we want to predict if a person has cancer or not, the majority of people do not have cancer and we can simply predict that no one has cancer and we will get a high accuracy as we have detected almost every case correctly but the model is not good at all.

Laplace

Having a word which only exists in one of the poet's work in the training data, the probaility of that word given the other poet will be zero and as we consider the multiplication of these probabilities, the result will be zero ignoring all other probabilities, so the given poet will not be assigned to this poet.

In order to fix this, I added a small alpha to the count of each word while calculating the corresponding probability. I also added distinct_count * alpha to denominator in order to have the sum of 1 for the new probabilities.

For instance the percentages before laplace is shown below in one case:

Recall: 0.7213213213213213
Precision: 0.7200239808153477
Accuracy: 0.7771661081857348

After Laplace:

Recall: 0.7645645645645646
Precision: 0.755938242280285
Accuracy: 0.8078027764480613

As you can see above, all the percentages have improved.

data_1 = pd.read_csv("./output.csv", encoding="utf-8")
data_2 = pd.read_csv("/Users/yasaman/Desktop/yasaman.csv", encoding="utf-8")

data_1.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	id	label
0	1	saadi
1	2	hafez
2	3	saadi
3	4	saadi
4	5	saadi

data_2.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	id	label
0	1	saadi
1	2	hafez
2	3	saadi
3	4	saadi
4	5	saadi

(data_1 == data_2)['label'].

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	id	label
0	True	True
1	True	True
2	True	True
3	True	True
4	True	True
5	True	True
6	True	True
7	True	True
8	True	True
9	True	True
10	True	True
11	True	True
12	True	True
13	True	True
14	True	True
15	True	True
16	True	True
17	True	True
18	True	True
19	True	True
20	True	True
21	True	True
22	True	True
23	True	True
24	True	True
25	True	True
26	True	True
27	True	True
28	True	True
29	True	True
...	...	...
1052	True	True
1053	True	True
1054	True	True
1055	True	True
1056	True	True
1057	True	True
1058	True	True
1059	True	True
1060	True	True
1061	True	False
1062	True	True
1063	True	True
1064	True	True
1065	True	True
1066	True	True
1067	True	True
1068	True	True
1069	True	True
1070	True	True
1071	True	True
1072	True	True
1073	True	True
1074	True	True
1075	True	True
1076	True	True
1077	True	True
1078	True	True
1079	True	True
1080	True	True
1081	True	True

1082 rows × 2 columns

len(data_1)

YasamanJafari/PoetClassifier-HafezAndSaadi-

CA3 - Naive Bayes Classification

Yasaman Jafari

810195376

Operate On Test Data

Laplace Smoothing

Evaluate

Report Questions and Explanations

Parameters

Extra Questions

Laplace