SMS Spam Classification Model

Welcome to the SMS Spam Classification project! This repository contains the code and resources for building a machine learning model to classify SMS messages as either "spam" or "genuine." This project was developed as part of my internship with Bharath Intern in the domain of Data Science.

Table of Contents


The goal of this project is to develop a model that can accurately classify SMS messages as spam or genuine. Spam messages can be a nuisance and even a security threat, so having an effective classification system is essential for filtering out unwanted messages.


The dataset used in this project is the SMS Spam Collection Data Set, which contains a collection of SMS messages labeled as either spam or ham.

  • Source: UCI Machine Learning Repository
  • Format: CSV file with columns label and message
    • label: Indicates whether the message is spam or genuine
    • message: The text of the SMS message


Preprocessing steps include:

  1. Removing Punctuation: Eliminates unnecessary punctuation marks.
  2. Converting to Lowercase: Standardizes the text to lowercase.
  3. Removing Digits: Strips out numeric characters.


The following machine learning algorithms were used to build the classification models:

  • Naive Bayes
  • Decision Tree
  • Random Forest
  • K-Nearest Neighbors (KNN)


The models were evaluated using:

  • Train-Test Split: Split the dataset into training, validation, and test sets.
  • TF-IDF Vectorization: Transformed text data into numerical features.
  • K-Fold Cross-Validation: Employed to ensure model reliability.

Evaluation metrics include:

  • Accuracy
  • Precision
  • Recall
  • Classification Report


Each model's performance was assessed based on accuracy, precision, and recall. The results of the cross-validation and validation steps guided the selection of the best-performing model.


To use this project, follow these steps:

  1. Clone the repository
  git clone
  1. Install dependencies
install pandas scikit-learn numpy
  1. Run the script
  1. Load and use the saved models


If python is not working try python3

Future Work

Potential improvements and future work include:

  • Expanding the dataset for better generalization.
  • Exploring more advanced NLP techniques and models.
  • Integrating the model into a real-time SMS filtering application.


Contributions are welcome! If you have suggestions for improvements or new features, feel free to submit a pull request or open an issue.


This project is licensed under the MIT License. See the LICENSE file for details.