Use Bayesian classification to analyze SMS messages.
After completing this assignment, you should understand:
- The uses of classification and clustering
- Good feature extraction
After completing this assignment, you should be able to:
- Parse text into features
- Classify textual data
- Cluster textual data
- A Git repo called sms-spam containing at least:
README.md
file explaining how to run your project- a
requirements.txt
file
- No PEP8 or Pyflakes warnings or errors
Download the SMS Spam collection from the UCI Machine Learning Repository.
Choose a set of features to use in order to separate SMS ham from spam.
Write a program to extract the features you want from each SMS message and then classify each SMS as ham or spam. Iterate on your feature extraction until you have a classification success level you are comfortable with (> 75% minimum.)
You may want to start with using CountVectorizer
and then possibly build your
own feature extractor.
In addition to the requirements from Normal Mode:
Write your own feature extractor using the same interface as CountVectorizer
and other scikit-learn
tools so you can include it in a Pipeline. Use a Pipeline
for your classifier. Bonus points if you can start from the raw text of the
SMSSpamCollection file without any processing outside of a pipeline.
Some features you might want to try:
- Presence of the following words: claim, winner
- Presence of money symbols
- Presence of numbers
- Presence of first-person words