/sms-spam-classifier-using-complement-naive-bayes

Machine learning model that predicts whether a message is spam or not.

Primary LanguageJupyter Notebook

Banner

In this project, I have attempted to analyze the SMS spam dataset and build a machine learning model to predict whether the message is spam or not.

💾 Project Files Description

This project contains an executable iPython Notebook, a presentation and source as follows:

Executable Files:

  • SMS_Spam_Classifier.ipynb - Google Colab notebook containing data summary, exploration, visualisations, text processing, modelling and performance evaluation.

Source Directory:

  • SMSSpamCollection - Includes SMS spam collection.

📖 Problem Statement

Almost every person today owns a mobile phone with messaging and calling capabilities. Spam calls are infamous for the constant ringing of cell phones they often initiate to get promotional or fraudulent information to innocent customers. However, with the cheaper rates on bulk messaging services from wireless networks, a swarm of these spam calls has quickly shifted over to SMS messaging. There, in this scenario, classification becomes mandatory. The objective of this project is to understand the SMS spam collection dataset and build a machine learning model to predict whether the mail is spam or not.

📖 Approach

  1. Understanding the business task.
  2. Reading data from files given.
  3. Data pre-processing.
  4. Data visualization.
  5. Text processing.
  6. Modelling data.
  7. Conclusion.

📖 Text Processing

  • Stemming is used for text normalization since getting base words is more crucial than the meaning of words to determine whether the message is positive or not.
  • Bag-of-Words was used for feature extraction from text since just the frequency of words needs to be considered instead of their importance.
  • 📖 Modelling

  • Complement naive bayes classifier was used for training as each feature represents the frequency of the word in each message and to correct the severe assumptions made due to the imbalanced dataset.
  • 📘: Conclusion

    Result

    📜 Credits

    Midhun R | Avid Learner | Data Analyst | Data Scientist | Machine Learning Enthusiast

    Contact me for Data Science Project Collaborations

    LinkedIn Badge GitHub Badge Medium Badge Resume Badge

    📚 References

    Image by upklyak on Freepik