/OIBSIP

Primary LanguageJupyter Notebook

SMS Spam Classification

This project aims to build a machine learning model to classify SMS messages as either spam or ham (not spam). The dataset used is the "spam.csv" dataset.

Table of Contents

  1. Data Cleaning
  2. Exploratory Data Analysis (EDA)
  3. Text Preprocessing
  4. Model Building
  5. Model Evaluation
  6. Model Improvement
  7. Voting Classifier
  8. Conclusion

Data Cleaning

The dataset contains SMS messages with columns 'target' and 'text'. The 'Unnamed: 2', 'Unnamed: 3', and 'Unnamed: 4' columns were dropped as they contained missing values. The 'target' column was encoded as 0 for ham and 1 for spam. Duplicate rows were also removed.

Exploratory Data Analysis (EDA)

EDA involved analyzing the distribution of text lengths, word counts, sentence counts, and creating word clouds to visualize the most common words in spam and ham messages.

Text Preprocessing

Text preprocessing steps included converting text to lowercase, tokenization, removing special characters, punctuation, and stop words. Porter stemming was also applied to reduce words to their root forms.

Model Building

Various classifiers were trained and evaluated, including Gaussian Naive Bayes, Multinomial Naive Bayes, Support Vector Classifier, Random Forest, and more. Accuracy using cross val score and precision were used as evaluation metrics. i was little bit bais toward precision as it matters the most here

Model Evaluation

The trained models were evaluated using accuracy and precision scores, and the results were visualized using bar plots.

Model Improvement

Different techniques were applied to improve model performance, such as changing max_features in TfIdf, scaling features, and including the number of characters in messages as an additional feature.

Voting Classifier

A voting classifier and stacking classifier were implemented to combine the predictions of multiple models, resulting in improved accuracy and precision.

Conclusion