/Spam-Filter

Analysis and Predction of the Spambase Dataset by Support Vector Machine, Naive Bayes and K-Nearest Neighbors

Primary LanguagePython

Spam Filter

Spam FIlter using Support Vector Machine, K-NN and Naive Bayes

Assignment

Write a spam filter using discrimitative and generative classifiers. Use the Spambase dataset which already represents spam/ham messages through a bag-of-words representations through a dictionary of 48 highly discriminative words and 6 characters. The first 54 features correspond to word/symbols frequencies; ignore features 55-57; feature 58 is the class label (1 spam/0 ham).

  1. Perform SVM classification using linear, polynomial of degree 2, and RBF kernels over the TF/IDF representation. Can you transform the kernels to make use of angular information only (i.e., no length)? Are they still positive definite kernels?

  2. Classify the same data also through a Naive Bayes classifier for continuous inputs, modelling each feature with a Gaussian distribution, resulting in the following model:
    equation
    equation
    where α_k is the frequency of class k, and μ_ki, σ^2_ki are the means and variances of feature i given that the data is in class k.

  3. Perform k-NN clasification with k=5

Provide the code, the models on the training set, and the respective performances in 10 way cross validation. Explain the differences between the three models.

How to start the Application

  • You need to install the following packages:
pip3 install numpy
pip3 install pandas 
pip3 install matplotlib
pip3 install scikit-learn
  • Enter in the main directory of the project
  • Type python3 solve.py