Spam Classifier built using CountVectorizer and Tf-idf Vectorizer. Source of dataset: https://www.kaggle.com/uciml/sms-spam-collection-dataset We employed Upsampling and Cross-val in our project, and built the following models:
- Naive Bayes model with imbalanced dataset, using CountVectorizer
- Naive Bayes model with imbalanced dataset, using Tf-idf Vectorizer
- Naive Bayes model with cross-validation, using CountVectorizer
- Naive Bayes model with cross-validation, using Tf-idf Vectorizer
- Decision Tree models with imbalanced dataset, cross-val, and upsampled data. (6 models in total)
For EDA, we created the following:
- Histogram of most commonly occuring words in the ham and spam messages
- Wordclouds of most commonly occurring words in the ham and spam messages
- Bar chart showing the number of spam and ham messages
We reported the f-measure and accuracy scores of each model as part of our findings in our powerpoint presentation, which is uploaded as well.