Spam classifier using Support Vector Machines

The training data

The training data is present in file train and train-small while unlabled data is present in file test

Usage

Implementation is done using CVX package for convex-optimization which was later compared with classification done using LibSVM for matlab.
Each matlab script properly documented and it explains what it is doing.

Theory & Performance

Possible improvements

Use capitalization data - right now we are using lowercased data. But anecdotally it seems like spams have a higher chance of being in all caps [ shouting , Supurios offers, etc ].
Use punctuation - the classifier doesn't really use punctuation, this is most likely a mistake because spams seem to have a lot of weird punctuation and ascii art.
Search for keywords - just tokenizing the comment isn't the best because a lot of spam comments look like "pleasecheckoutmyfacebookpageatwwwfacebookcom/blah"
Most of the feature which are used in twitter-sentiment-analyis can be used.

Contributing

Fork it!
Create your branch: git checkout -b my-new-feature
Commit your changes: `git commit -m 'Added Some featues'``
Push to the branch: git push origin my-new-feature
Submit a pull request :)

Credits

Devansh Dalal (@devanshdalal)