/NV_SVM_SpamFilter

Quick spam filter which deploys naive Bayes and SVM classifiers

Primary LanguageJavaMIT LicenseMIT

A Bayesian and SVM Spam Filter by Nicholas Shelly
Date: 3 Jan 2012
This is a very quick and rudimentary spam filter, which deploys a few techniques including:
- A Naive Bayesian classification based on the and Laplace (additive) smoothing
- Auto-learning to retraining data when receiving an email with really high or really low spamtiscity.
- A reliable but slower Support Vector Machine (SVM) classifier based on the most common, significant words as features

For the SVM classifier, download this package from here: http://bit.ly/2rwxl, and
add to your Python path, and type 'make' to create the binaries.

Sample output:
SpamFilter$ python filter.py 
Reading tarfile  data/20030228_hard_ham.tar.bz2 ... done, read 252 emails.
Reading tarfile  data/20030228_spam_2.tar.bz2   ... done, read 1399 emails.
Reading tarfile  data/20050311_spam_2.tar.bz2   ... done, read 1398 emails.
Read 3046 emails, 91.76% spam 

Top words most likely to be spam:
minder = 0.997942, 1316 occurrences
cpunks = 0.997835, 1251 occurrences
mandark = 0.997732, 1194 occurrences
einstein = 0.997664, 1159 occurrences
cypherpunks = 0.996160, 1409 occurrences
sourceforge = 0.996034, 682 occurrences
2ffont = 0.995890, 658 occurrences
3cfont = 0.995890, 658 occurrences
sightings = 0.995768, 639 occurrences

Top words least likely to be spam:
clickthru = 0.000189, 1950 occurences
lockergnome = 0.000351, 1051 occurences
comics = 0.000527, 700 occurences
dilbert = 0.000801, 461 occurences
anchordesk = 0.001288, 287 occurences
techupdate = 0.001456, 254 occurences
clear_dot = 0.001745, 212 occurences
theregister = 0.002033, 182 occurences
83a3cb = 0.002045, 181 occurences
unitedmedia = 0.002140, 173 occurences

Variables:
tao- = 0.050000
tao+ = 0.980000
Prob(Spam) = 0.450000
Spam cutoff = 0.750000
Count minimum = 20

########################################################
Naive Bayes:
67303 unique tokens
Training error:  0.0164149704531
Development error:  0.0223243598162
Auto-learning on 1425 spam and 87 ham
False positives = 0.019%
False negatives = 0.020%

########################################################
SVM (2323 features):
Building training samples...
*
optimization finished, #iter = 693
nu = 0.004494
obj = -4.422320, rho = 0.306859
nSV = 180, nBSV = 2
Total nSV = 180
Training error: 0.000656598818122
Building development samples...
Development error: 0.00919238345371
False positives = 0.005%
False negatives = 0.000%