- Classify spam/ham with data from machine learning repository using scikit tfidvectorizer
- load data, preprocess by removing stopwords, punctuations and lowercase all the characters.
- check the data actual spam, ham counts, get top words related to spam/ham.
- vectorize the text by tfidvectorizer, since it performs better than countvectorizer.
- fit the vectorized matrix into randomforestclassifier, multinomialNB and compare the results