The training data is present in file train and train-small while unlabled data is present in file test
Implementation is done using CVX package for convex-optimization which was later compared with classification done using LibSVM for matlab.
Each matlab script properly documented and it explains what it is doing.
-
Use capitalization data - right now we are using lowercased data. But anecdotally it seems like spams have a higher chance of being in all caps [ shouting , Supurios offers, etc ].
-
Use punctuation - the classifier doesn't really use punctuation, this is most likely a mistake because spams seem to have a lot of weird punctuation and ascii art.
-
Search for keywords - just tokenizing the comment isn't the best because a lot of spam comments look like "pleasecheckoutmyfacebookpageatwwwfacebookcom/blah"
-
Most of the feature which are used in twitter-sentiment-analyis can be used.
- Fork it!
- Create your branch:
git checkout -b my-new-feature
- Commit your changes: `git commit -m 'Added Some featues'``
- Push to the branch:
git push origin my-new-feature
- Submit a pull request :)
- Devansh Dalal (@devanshdalal)