In this project, Naive Bayes is used to classify poems of two persian poets, "Hafez" and "Saadi".
import pandas as pd
import re
import math
First, we read the data from .csv file.
data = pd.read_csv("./Data/train_test.csv", encoding="utf-8")
data.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
text | label | |
---|---|---|
0 | چون میرود این کشتی سرگشته که آخر | hafez |
1 | که همین بود حد امکانش | saadi |
2 | ارادتی بنما تا سعادتی ببری | hafez |
3 | خدا را زین معما پرده بردار | hafez |
4 | گویی که در برابر چشمم مصوری | saadi |
As you can see above, we are provided with about 20000 samples and each one is labeled with its poet.
print(data["text"][0])
چون میرود این کشتی سرگشته که آخر
First, we need to separate our data into two parts of train set and test set. Eighty percent of data is considered as train set. We need to shuffle the data and then choose 80% for train and leave the rest for test.
We choose a random subset for train and test so that it is not sorted in any specific order and it can represent the entire poems collection better.
train = data.sample(frac = 0.8, random_state = 42)
test = data.drop(train.index)
print("Train Percentage: ", len(train) / (len(test) + len(train)))
print("Test Percentage: ", len(test) / (len(test) + len(train)))
Train Percentage: 0.7999904255828426
Test Percentage: 0.20000957441715736
train.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
text | label | |
---|---|---|
3128 | سه ماه می خور و نه ماه پارسا میباش | hafez |
8157 | زاهد بنگر نشسته دلتنگ | saadi |
6682 | ولیکن تا به چوگان میزنندش | saadi |
11526 | تا فخر دین عبدالصمد باشد که غمخواری کند | hafez |
7477 | تیغ جفا گر زنی ضرب تو آسایشست | saadi |
Now we need to find all the words in the poems. For this I created a new column which keeps the list of words used in each poem.
train['index'] = train.index
train['words'] = train.text.str.split().to_frame()
train.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
text | label | index | words | |
---|---|---|---|---|
3128 | سه ماه می خور و نه ماه پارسا میباش | hafez | 3128 | [سه, ماه, می, خور, و, نه, ماه, پارسا, میباش] |
8157 | زاهد بنگر نشسته دلتنگ | saadi | 8157 | [زاهد, بنگر, نشسته, دلتنگ] |
6682 | ولیکن تا به چوگان میزنندش | saadi | 6682 | [ولیکن, تا, به, چوگان, میزنندش] |
11526 | تا فخر دین عبدالصمد باشد که غمخواری کند | hafez | 11526 | [تا, فخر, دین, عبدالصمد, باشد, که, غمخواری, کند] |
7477 | تیغ جفا گر زنی ضرب تو آسایشست | saadi | 7477 | [تیغ, جفا, گر, زنی, ضرب, تو, آسایشست] |
In order to find all the words used in the set, I combined all the lists and then converted it to a set in order to remove duplicates.
Some words are commonly used in persian and are not useful in classification.
stop_words = ['دل', 'گر', 'ما', 'هر', 'با', 'ای', 'سر', 'تا', 'چو', 'نه']
Now, I'll separate train data of Hafez and Saadi to do some process on them.
hafez_train = train[train['label'] == "hafez"].drop(['text', 'label', 'index'], axis=1)
saadi_train = train[train['label'] == "saadi"].drop(['text', 'label', 'index'], axis=1)
print("Hafez Count: ", len(hafez_train))
print("Saadi Count: ", len(saadi_train))
Hafez Count: 6753
Saadi Count: 9958
I'll consider each word as a feature. For this purpose, I need to find all distinct words in the train set.
words = []
for poem in train['words']:
words += poem
words = list(set(words))
print("distinct_words_count: ", len(words))
distinct_words_count: 12645
find_prob() function is used to we need to find the conditional probability of each word given the poet. For each word we need to find the number of occurances of it in a poet's poems and divide it by the total number of words used by that poet.
def find_prob(hafez_train, saadi_train):
hafez_words = []
for poem in hafez_train['words']:
hafez_words += poem
print("Hafez_count: ", len(hafez_words))
saadi_words = []
for poem in saadi_train['words']:
saadi_words += poem
print("Saadi_count: ", len(saadi_words))
train_all_word_count = pd.DataFrame(columns=['hafez_count', 'saadi_count'])
for word in words:
train_all_word_count = train_all_word_count.append({'word': word, 'hafez_count': hafez_words.count(word), 'saadi_count': saadi_words.count(word)}, ignore_index=True)
train_all_word_count = train_all_word_count.set_index('word')
train_all_word_count['hafez_prob'] = train_all_word_count['hafez_count'] / train_all_word_count['hafez_count'].sum()
train_all_word_count['saadi_prob'] = train_all_word_count['saadi_count'] / train_all_word_count['saadi_count'].sum()
return train_all_word_count, hafez_words, saadi_words
train_all_word_count, hafez_words, saadi_words = find_prob(hafez_train, saadi_train)
Hafez_count: 50650
Saadi_count: 70650
train_all_word_count.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
hafez_count | saadi_count | hafez_prob | saadi_prob | |
---|---|---|---|---|
word | ||||
دلبند | 1 | 8 | 1.97433e-05 | 0.000113234 |
بدبین | 1 | 0 | 1.97433e-05 | 0 |
احوال | 7 | 4 | 0.000138203 | 5.66171e-05 |
خودپسند | 1 | 0 | 1.97433e-05 | 0 |
شهد | 3 | 7 | 5.923e-05 | 9.908e-05 |
The prior probabilities are calculated below. The prior probability of each poets in the total number of that poet's poems divided by all poems by both poets.
hafez_prob = len(hafez_train) / (len(hafez_train) + len(saadi_train))
saadi_prob = len(saadi_train) / (len(hafez_train) + len(saadi_train))
print("Hafez Probability: ", hafez_prob)
print("Saadi Probability: ", saadi_prob)
Hafez Probability: 0.4041050804859075
Saadi Probability: 0.5958949195140926
At this step, we will eliminate the words which are only used once by Hafez or Saadi as they cannot be distinctive.
# train_all_word_count['all_count'] = train_all_word_count['hafez_count'] + train_all_word_count['saadi_count']
# one_occurance = train_all_word_count[train_all_word_count['all_count'] == 1]
# once_used = list(one_occurance.index)
# words = list(set(words) - set(once_used))
# train_all_word_count = find_prob()
Now we have built our model and need to predict the poet of the test data.
test['index'] = test.index
test['words'] = test.text.str.split().to_frame()
test.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
text | label | index | words | |
---|---|---|---|---|
9 | رفتی و همچنان به خیال من اندری | saadi | 9 | [رفتی, و, همچنان, به, خیال, من, اندری] |
11 | آنجا که تویی رفتن ما سود ندارد | saadi | 11 | [آنجا, که, تویی, رفتن, ما, سود, ندارد] |
13 | اندرونم با تو میآید ولیک | saadi | 13 | [اندرونم, با, تو, میآید, ولیک] |
16 | که خوش آهنگ و فرح بخش هوایی دارد | hafez | 16 | [که, خوش, آهنگ, و, فرح, بخش, هوایی, دارد] |
24 | ناودان چشم رنجوران عشق | saadi | 24 | [ناودان, چشم, رنجوران, عشق] |
In Naive Bayes, we have the strong assumption of independence between each two features.
So, the probability of each poet given the words is proportional to the multiplication of all the probabilities of each word given the poet multiplied by the prior probability which is the probability of each poet in general.
After calculating this probability for Hafez and Saadi, we compare them and decide based on the result.
def predict(df):
for index, row in df.iterrows():
curr_hafez_prob = len(hafez_train) / (len(hafez_train) + len(saadi_train))
curr_saadi_prob = len(saadi_train) / (len(hafez_train) + len(saadi_train))
for word in set(row["words"]):
if word in words:
curr_hafez_prob *= train_all_word_count.at[word, 'hafez_prob']
curr_saadi_prob *= train_all_word_count.at[word, 'saadi_prob']
df.at[index, 'hafez_prob'] = curr_hafez_prob
df.at[index, 'saadi_prob'] = curr_saadi_prob
df['prediction_is_hafez'] = df['hafez_prob'] >= df['saadi_prob']
prediction_poet = {True: 'hafez', False: 'saadi'}
df['prediction'] = df['prediction_is_hafez'].map(prediction_poet)
predict(test)
test.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
text | label | index | words | hafez_prob | saadi_prob | prediction_is_hafez | prediction | |
---|---|---|---|---|---|---|---|---|
9 | رفتی و همچنان به خیال من اندری | saadi | 9 | [رفتی, و, همچنان, به, خیال, من, اندری] | 0.000000e+00 | 6.067397e-21 | False | saadi |
11 | آنجا که تویی رفتن ما سود ندارد | saadi | 11 | [آنجا, که, تویی, رفتن, ما, سود, ندارد] | 2.686915e-23 | 5.948826e-22 | False | saadi |
13 | اندرونم با تو میآید ولیک | saadi | 13 | [اندرونم, با, تو, میآید, ولیک] | 0.000000e+00 | 1.970144e-16 | False | saadi |
16 | که خوش آهنگ و فرح بخش هوایی دارد | hafez | 16 | [که, خوش, آهنگ, و, فرح, بخش, هوایی, دارد] | 7.862324e-25 | 0.000000e+00 | True | hafez |
24 | ناودان چشم رنجوران عشق | saadi | 24 | [ناودان, چشم, رنجوران, عشق] | 4.217595e-06 | 8.562444e-06 | False | saadi |
In order to evaluate how good our model is, we use:
- Recall
- Precision
- Accuracy
def evaluate(df):
df['correct'] = (df['label'] == df['prediction'])
correct_count = (df['correct']).sum()
correct_hafez = (df[['correct', 'prediction_is_hafez']].all(axis='columns')).sum()
all_hafez = (df['label'] == 'hafez').sum()
all_hafez_detected = (df['prediction'] == 'hafez').sum()
accuracy = correct_count / len(test)
precision = correct_hafez / all_hafez_detected
recall = correct_hafez / all_hafez
print("Recall: ", recall)
print("Precision: ", precision)
print("Accuracy: ", accuracy)
evaluate(test)
Recall: 0.7225225225225225
Precision: 0.7229567307692307
Accuracy: 0.7790808999521303
If a word is used only in one poet's work, the probability of it given the other poet, will be zero and as we multiply the probabilities, the result will be zero not considering any other features.
In order to fix this, we add a fixed alpha to the count of all words in its poet's collection and add count * alpha to the denominator.
alpha = 0.5
train_all_word_count['hafez_prob'] = (train_all_word_count['hafez_count'] + alpha) / (train_all_word_count['hafez_count'].sum() + (len(set(hafez_words + saadi_words))* alpha))
train_all_word_count['saadi_prob'] = (train_all_word_count['saadi_count'] + alpha) / (train_all_word_count['saadi_count'].sum() + (len(set(saadi_words + hafez_words))* alpha))
print(train_all_word_count['hafez_prob'].sum())
print(train_all_word_count['saadi_prob'].sum())
1.000000000000271
1.000000000000061
predict(test)
evaluate(test)
Recall: 0.7375375375375376
Precision: 0.7767235926628716
Accuracy: 0.8109143130684539
data['index'] = data.index
data['words'] = data.text.str.split().to_frame()
hafez_data = data[data['label'] == "hafez"].drop(['text', 'label', 'index'], axis=1)
saadi_data = data[data['label'] == "saadi"].drop(['text', 'label', 'index'], axis=1)
words = []
for poem in data['words']:
words += poem
words = list(set(words))
print("distinct_words_count: ", len(words))
train_all_word_count, hafez_words, saadi_words = find_prob(hafez_data, saadi_data)
eval_data = pd.read_csv("./Data/evaluate.csv", encoding="utf-8")
alpha = 0.5
train_all_word_count['hafez_prob'] = (train_all_word_count['hafez_count'] + alpha) / (train_all_word_count['hafez_count'].sum() + (len(set(hafez_words + saadi_words)) * alpha))
train_all_word_count['saadi_prob'] = (train_all_word_count['saadi_count'] + alpha) / (train_all_word_count['saadi_count'].sum() + (len(set(hafez_words + saadi_words)) * alpha))
eval_data['index'] = eval_data.id
eval_data['words'] = eval_data.text.str.split().to_frame()
predict(eval_data)
distinct_words_count: 14084
Hafez_count: 63077
Saadi_count: 88560
eval_data.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
id | text | index | words | hafez_prob | saadi_prob | prediction_is_hafez | prediction | |
---|---|---|---|---|---|---|---|---|
0 | 1 | ور بی تو بامداد کنم روز محشر است | 1 | [ور, بی, تو, بامداد, کنم, روز, محشر, است] | 3.001295e-26 | 7.121545e-24 | False | saadi |
1 | 2 | ساقی بیار جامی کز زهد توبه کردم | 2 | [ساقی, بیار, جامی, کز, زهد, توبه, کردم] | 2.353900e-23 | 1.252923e-25 | True | hafez |
2 | 3 | مرا هرآینه خاموش بودن اولیتر | 3 | [مرا, هرآینه, خاموش, بودن, اولیتر] | 3.377382e-22 | 2.224892e-20 | False | saadi |
3 | 4 | تو ندانی که چرا در تو کسی خیره بماند | 4 | [تو, ندانی, که, چرا, در, تو, کسی, خیره, بماند] | 4.951443e-25 | 6.114006e-23 | False | saadi |
4 | 5 | کاینان به دل ربودن مردم معینند | 5 | [کاینان, به, دل, ربودن, مردم, معینند] | 1.256054e-23 | 5.174989e-22 | False | saadi |
output = pd.DataFrame({
"id": eval_data['index'],
"label": eval_data['prediction'],
})
output.to_csv('output.csv', index=False)
output.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
id | label | |
---|---|---|
0 | 1 | saadi |
1 | 2 | hafez |
2 | 3 | saadi |
3 | 4 | saadi |
4 | 5 | saadi |
In my algorithm, each distinct word is considered a feature.
Bayesian probability consists of 4 parts:
- Prior
- Posterior
- Likelihood
- Evidence
In the equation above, we have:
- Prior: P(Poet)
- Likelihood: P(W0, W1, W2, ..., Wn|Poet)
- Evidence: P(W0, W1, W2, ..., Wn)
- Posterior: P(Poet|W0, W1, W2, ..., Wn)
In other words, Prior is the probability of each poet in general, this means how probable it is for a poem to be for a certain poet in general not considering any other data. In order to calculate the prior, for each poet, we divide the number of that poet's poems by the number of all given poems.
Likelihood is the probability of each word of a poem given the poet. In Naive Bayes, each feature is independent of others, so this is the multiplication of the probabilities of each word given the poet. This means how probable it is for a certain poet to use that word. To calculate this, we multiply the probabilities of each word given the poet. The probaility of each word given the poet is the number of times that word is used in that poet's works divided by the total number of words in that poet's work.
Evidence is the probabiliy of all words that we have in a given poem. We do not need to calculate this as it is the same for both poets and does not change the result of comparison. If we wanted to calculate this, we could multiply the probabilites of all words. The probability of each word is the number of its occurance divided by the count of all words.
Posterior is the probaility of a poet given the words in a poem. We use bayesian rule stated below to calculate this.
1. What is the problem if we only use precision to evaluate our model?
If we only use precision for model evaluation, we can get a 100 percent precision if we manage to correctly guess only one poem of the corresponding poet, and predict the other poet for all other given poems.
In other words if we are able to predict one of Hafez's poems correctly and assign it to Hafez and then assign all the other poems to Saadi, the precision for Hafez will be 100%.
2. Why isn't accuracy enough for evaluating the model?
If the majority of the data belongs to a specific class, accuracy will not be a good measure to evaluate our model. For instance if we want to predict if a person has cancer or not, the majority of people do not have cancer and we can simply predict that no one has cancer and we will get a high accuracy as we have detected almost every case correctly but the model is not good at all.
Having a word which only exists in one of the poet's work in the training data, the probaility of that word given the other poet will be zero and as we consider the multiplication of these probabilities, the result will be zero ignoring all other probabilities, so the given poet will not be assigned to this poet.
In order to fix this, I added a small alpha to the count of each word while calculating the corresponding probability. I also added distinct_count * alpha to denominator in order to have the sum of 1 for the new probabilities.
For instance the percentages before laplace is shown below in one case:
- Recall: 0.7213213213213213
- Precision: 0.7200239808153477
- Accuracy: 0.7771661081857348
After Laplace:
- Recall: 0.7645645645645646
- Precision: 0.755938242280285
- Accuracy: 0.8078027764480613
As you can see above, all the percentages have improved.
data_1 = pd.read_csv("./output.csv", encoding="utf-8")
data_2 = pd.read_csv("/Users/yasaman/Desktop/yasaman.csv", encoding="utf-8")
data_1.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
id | label | |
---|---|---|
0 | 1 | saadi |
1 | 2 | hafez |
2 | 3 | saadi |
3 | 4 | saadi |
4 | 5 | saadi |
data_2.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
id | label | |
---|---|---|
0 | 1 | saadi |
1 | 2 | hafez |
2 | 3 | saadi |
3 | 4 | saadi |
4 | 5 | saadi |
(data_1 == data_2)['label'].
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
id | label | |
---|---|---|
0 | True | True |
1 | True | True |
2 | True | True |
3 | True | True |
4 | True | True |
5 | True | True |
6 | True | True |
7 | True | True |
8 | True | True |
9 | True | True |
10 | True | True |
11 | True | True |
12 | True | True |
13 | True | True |
14 | True | True |
15 | True | True |
16 | True | True |
17 | True | True |
18 | True | True |
19 | True | True |
20 | True | True |
21 | True | True |
22 | True | True |
23 | True | True |
24 | True | True |
25 | True | True |
26 | True | True |
27 | True | True |
28 | True | True |
29 | True | True |
... | ... | ... |
1052 | True | True |
1053 | True | True |
1054 | True | True |
1055 | True | True |
1056 | True | True |
1057 | True | True |
1058 | True | True |
1059 | True | True |
1060 | True | True |
1061 | True | False |
1062 | True | True |
1063 | True | True |
1064 | True | True |
1065 | True | True |
1066 | True | True |
1067 | True | True |
1068 | True | True |
1069 | True | True |
1070 | True | True |
1071 | True | True |
1072 | True | True |
1073 | True | True |
1074 | True | True |
1075 | True | True |
1076 | True | True |
1077 | True | True |
1078 | True | True |
1079 | True | True |
1080 | True | True |
1081 | True | True |
1082 rows × 2 columns
len(data_1)
1082