Spam Filter

Part 1 (40%):

Your program classifies the testing set with an accuracy significantly higher than random within 30 minutes Use very simple data preprocessing so that the emails can be read into the Naive Bayes (remove everything else other than words from emails) Write simple Naive Bayes multinomial classifier or use an implementation from a library of your choice Classify the data Report your results with a metric (e.g. accuracy) and method (e.g. cross validation) of your choice Choose a baseline and compare your classifier against it

Part 2 (30%):

Use some smart feature processing techniques to improve the classification results Compare the classification results with and without these techniques Analyse how the classification results depend on the parameters (if available) of chosen techniques Compare (statistically) your results against any other algorithm of your choice (use can use any library); compare and contrast results, ensure fair comparison

Part 3 (30%):

Calibration (15%): calibrate Naive Bayes probabilities, such that they result in low mean squared error Naive Bayes extension (15%): modify the algorithm in some interesting way (e.g. weighted Naive Bayes)

cg14823/SpamFilter

Spam Filter

Part 1 (40%):

Part 2 (30%):

Part 3 (30%):