Dataset: spam.csv
from Kaggle
Algorithm: Naive Bayes Classifier
Screenshots:
- The program begins by importing necessary libraries like
numpy
,pandas
, andscikit-learn
modules. Pandas
is used for data manipulation, whilescikit-learn
is used for machine learning.
- The program loads a dataset
spam.csv
usingpandas.read_csv()
. The data likely consists of text messages and labelsspam
orham
.
- Columns that are not necessary (like unnamed columns) are dropped from the dataset.
- The columns are renamed: one column is renamed to
target
(indicating spam or ham), and other column is renamed totext
. - The target labels are encoded using
LabelEncoder
where the classesspam
andham
are transformed into binary values0
and1
.
- The program checks for missing values using
isnull().sum()
and counts the number of duplicates usingdf.duplicated().sum()
. - Duplicates are dropped, keeping only the first occurrence.
- Text data is preprocessed by converting it into lowercase, tokenizing, removing special characters, and stopwords.
- The text data is then transformed into vectors using TF-IDF (Term Frequency-Inverse Document Frequency) via
TfidfVectorizer
. This converts the text into a numerical format suitable for machine learning models.
- The dataset is split into training and testing sets using
train_test_split()
. A portion(80%) of the data is used for training, and the remaining 20% is used for testing the model.
-
Three Naive Bayes classifiers are instantiated:
GaussianNB
,MultinomialNB
&BernoulliNB
-
Each of these classifiers is trained on the training data using
fit()
.
-
After training, the program evaluates each classifier on the test data using:
Accuracy Score
(Percentage of correct predictions)Confusion Matrix
(A matrix summarizing true positives, false positives, true negatives, and false negatives.)Precision Score
(Measures the precision of the classifier for spam prediction.) -
These metrics help compare the performance of the classifiers.
- After the evaluation, the program saves the trained model
MultinomialNB
and the vectorizerTfidfVectorizer
using thepickle
library. - These saved models can be loaded later for real-time spam classification.
-
Naive Bayes Classifiers
: Naive Bayes is ideal for text classification problems due to its simplicity, effectiveness, and speed, especially when dealing with large datasets like text. -
TF-IDF Vectorization
: Converting text data into numerical form using TF-IDF is a common approach in Natural Language Processing (NLP). It helps to capture the importance of words in relation to their frequency in the text. -
Model Comparison
: By training multiple Naive Bayes models, the program ensures the best-performing model is chosen based on accuracy and precision metrics.
This structure allows for effective detection of spam in text data with high accuracy and precision, particularly using the MultinomialNB classifier which works well for discrete features like word counts.