/nlp_multinomial_naive_bayes

Text classification with Multinomial Naive Bayes and TF-IDF

Primary LanguageJupyter Notebook

Text Classification with Multinomial Naive Bayes

  1. Load Textual Data
  2. Text Preprocessing (TF-IDF, word count)
  3. Train Classifier
  4. Evaluate Results
  5. Test Model

Importing Libraries

from sklearn.naive_bayes import MultinomialNB # classifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer # text vectorizer
#from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score  # evaluation
from sklearn.datasets import fetch_20newsgroups # data
import matplotlib.pyplot as plt # visualization
import pandas as pd # data representation
from sklearn.pipeline import make_pipeline
from sklearn.metrics import plot_confusion_matrix
import pandas as pd

1. Load Textual Data

News articles in 20 different categories, for this tutorial we choose the following:

  • alt.atheism
  • comp.graphics
  • sci.med
  • soc.religion.christian
news = fetch_20newsgroups()
news.target_names
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
target_categories = ['alt.atheism','comp.graphics','sci.med','soc.religion.christian']

train = fetch_20newsgroups(subset='train', categories=target_categories)
test = fetch_20newsgroups(subset='test', categories=target_categories)
len(test.data), len(train.data)
(1502, 2257)

Sample

print(f'CATEGORY: {target_categories[train.target[0]]}')
print('-' * 80)
print(train.data[0])
print('-' * 80)
CATEGORY: comp.graphics
--------------------------------------------------------------------------------
From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.

--------------------------------------------------------------------------------

2. Text preprocessing

Text must be represented as numbers (vectors). There are several useful techniques to transform text into vectors:

  1. TF-IDF (Term Frequency - Inverse Document Frequency)
  2. Word Count
sample_sentences = [
    'My name is George, this is my name', 
    'I like apples', 
    'apple is my favorite fruit'
    ]

TF-IDF

tfidf = TfidfVectorizer()
vectorizer = tfidf.fit_transform(sample_sentences)
pd.DataFrame(vectorizer.toarray(), columns=tfidf.get_feature_names())
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
apple apples favorite fruit george is like my name this
0 0.000000 0.000000 0.000000 0.000000 0.306754 0.466589 0.000000 0.466589 0.613509 0.306754
1 0.000000 0.707107 0.000000 0.000000 0.000000 0.000000 0.707107 0.000000 0.000000 0.000000
2 0.490479 0.000000 0.490479 0.490479 0.000000 0.373022 0.000000 0.373022 0.000000 0.000000

Words Counting

count_vector = CountVectorizer()
vectorizer = count_vector.fit_transform(sample_sentences)
pd.DataFrame(vectorizer.toarray(), columns=count_vector.get_feature_names())
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
apple apples favorite fruit george is like my name this
0 0 0 0 0 1 2 0 2 2 1
1 0 1 0 0 0 0 1 0 0 0
2 1 0 1 1 0 1 0 1 0 0

3. Model

Build two models, but use different vectorization techniques: TF-IDF and Word Count

model_tfidf = make_pipeline(TfidfVectorizer(), MultinomialNB())
model_count = make_pipeline(CountVectorizer(), MultinomialNB())

1. Training

model_tfidf.fit(train.data, train.target), \
model_count.fit(train.data, train.target)
(Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),
                 ('multinomialnb', MultinomialNB())]),
 Pipeline(steps=[('countvectorizer', CountVectorizer()),
                 ('multinomialnb', MultinomialNB())]))

2. Predicting

y_pred_tfidf = model_tfidf.predict(test.data)
y_pred_count = model_count.predict(test.data)

3. Evaluation

f1 = f1_score(test.target, y_pred_tfidf, average='weighted')
accuracy = accuracy_score(test.target, y_pred_tfidf)
print('Multinomial Naive Bayes with TF-IDF:')
print('-' * 40)
print(f'f1: {f1:.4f}')
print(f'accuracy: {accuracy:.4f}')
Multinomial Naive Bayes with TF-IDF:
----------------------------------------
f1: 0.8368
accuracy: 0.8349
f1 = f1_score(test.target, y_pred_count, average='weighted')
accuracy = accuracy_score(test.target, y_pred_count)
print('Multinomial Naive Bayes with Word Count:')
print('-' * 40)
print(f'f1: {f1:.4f}')
print(f'accuracy: {accuracy:.4f}')
Multinomial Naive Bayes with Word Count:
----------------------------------------
f1: 0.9340
accuracy: 0.9341

4. Testing the Model

text = [
    'I believe in jesus', 
    'Nvidia released new video card', 
    'one apple a day takes a doctor away',
    'God does not exist',
    'My monitor supports HDR',
    'Vitamins are essential for your health and development'
]
y_pred = model_tfidf.predict(text)
for i in range(len(y_pred)):
    print(f'"{target_categories[y_pred[i]]:<22}" ==> "{text[i]}"')
"soc.religion.christian" ==> "I believe in jesus"
"comp.graphics         " ==> "Nvidia released new video card"
"sci.med               " ==> "one apple a day takes a doctor away"
"soc.religion.christian" ==> "God does not exist"
"comp.graphics         " ==> "My monitor supports HDR"
"sci.med               " ==> "Vitamins are essential for your health and development"