Classify and Cluster SMS Messages

Description

Use Bayesian classification to analyze SMS messages.

Objectives

Learning Objectives

After completing this assignment, you should understand:

The uses of classification and clustering
Good feature extraction

Performance Objectives

After completing this assignment, you should be able to:

Parse text into features
Classify textual data
Cluster textual data

Details

Deliverables

A Git repo called sms-spam containing at least:
- README.md file explaining how to run your project
- a requirements.txt file

Requirements

No PEP8 or Pyflakes warnings or errors

Normal Mode

Download the SMS Spam collection from the UCI Machine Learning Repository.

Choose a set of features to use in order to separate SMS ham from spam.

Write a program to extract the features you want from each SMS message and then classify each SMS as ham or spam. Iterate on your feature extraction until you have a classification success level you are comfortable with (> 75% minimum.)

You may want to start with using CountVectorizer and then possibly build your own feature extractor.

Hard Mode

In addition to the requirements from Normal Mode:

Write your own feature extractor using the same interface as CountVectorizer and other scikit-learn tools so you can include it in a Pipeline. Use a Pipeline for your classifier. Bonus points if you can start from the raw text of the SMSSpamCollection file without any processing outside of a pipeline.

Notes

Some features you might want to try:

Presence of the following words: claim, winner
Presence of money symbols
Presence of numbers
Presence of first-person words

IHautaI/sms-spam