/Spamassassin-GSoC

SVM for spamassassin statistical classifier plugin

Primary LanguageJupyter Notebook

Spamassassin-GSoC

I am currently a GSoC student under ASF working on the project Spamassassin. Apache Singa will be used in this project for the development of neural nets for spam detection.

alt text

Repo description

Dataset

This directory contains separate folders for sample spam/ham mbox mails which the user can use to train the Svm and Neural network model.

Jupyter_Notebooks

It contains the jupter notebooks for Svm and Keras model. For better visualisation and parameter tweaking users are sdvised to run jupyter notebooks.

Pickled_models

Sample pickled models which can directly be used for classification.

Spamassassin_files

This is the heart of the project. It contains a number of files,

  1. svm.cf - This is the configuration file needed for the plugin. Add this to /etc/mail/spamassassin directory.

  2. svm.pre - This file is added before .cf files. Used to lead the plugin. Place it in /etc/mail/spamassassin/directory

  3. svm.pm - This file has the Perl plugin code. Add in /usr/local/share/perl5 directory.

  4. svm_learn.py - The python script which taked the path of dataset as argument and dumps the pickled models which will be used by the plugin for classification.

  5. svm_python_call.py - This script is called by the .pm file. It takes the mail as an argument and returns the spamminess of the mail.

Project status

Original Goals

  1. Development of an effective SA plugin with various statistical classifiers for spam classification.

  2. Integration of the plugins in SA.

  3. Proper documentation and relevant tests for the plugin.

Achieved Goals

  1. A basic Plugin with two classifiers ( SVM and neural net ) is developed.

  2. Plugin was successfully integrated locally with SA.

  3. Documentation of the code is done.

Future goals

  1. Extend the scope of classifiers to other sections of MIME format mail namely, attachments and relevant headers.

  2. Adding dynamic functionality of making the plugin learn the correct classification of incorrectly classified mails.

  3. Extend the functionality which will make the plugin classify the incoming mail in shades of spamminess.

  4. Code an effective test file which covers the “perl calling python” aspect of the plugin.

  5. Decide on the best score range which the plugin should provide once the rule gets hit.

  6. Add a functionality which lets the user test the plugin on a given dataset for the model’s effectiveness.

  7. Make the neural net compatible with CPU only machines.

  8. Hopefully merge the code in the next major release of SA.