/Spam-Classification-Project

Spam Classifier project for my end-of-semester project for Intro to AI class. We were a group of four people. I worked on all the Naive Bayes models.

Primary LanguageJupyter Notebook

spam-classifier

Spam Classifier built using CountVectorizer and Tf-idf Vectorizer. Source of dataset: https://www.kaggle.com/uciml/sms-spam-collection-dataset We employed Upsampling and Cross-val in our project, and built the following models:

  • Naive Bayes model with imbalanced dataset, using CountVectorizer
  • Naive Bayes model with imbalanced dataset, using Tf-idf Vectorizer
  • Naive Bayes model with cross-validation, using CountVectorizer
  • Naive Bayes model with cross-validation, using Tf-idf Vectorizer
  • Decision Tree models with imbalanced dataset, cross-val, and upsampled data. (6 models in total)

For EDA, we created the following:

  • Histogram of most commonly occuring words in the ham and spam messages
  • Wordclouds of most commonly occurring words in the ham and spam messages
  • Bar chart showing the number of spam and ham messages

We reported the f-measure and accuracy scores of each model as part of our findings in our powerpoint presentation, which is uploaded as well.