Email Classification with Naive Bayes

This project aims to classify emails from the Enron Email Dataset as spam or not-spam (ham) using Naive Bayes Model.

File Contents and Purpose

  1. data
    • ham: contains all the ham emails for training.
    • spam: contains all the spam emails for training.
    • testing: contains a mixture of ham and spam emails for testing.
  2. naivebayes.py: consists of functions for learning the parameters of the model using the training data and measuring the model's performance using the testing data.
  3. util.py : consists of functions for parsing the data in the files.

Dataset

The data is a preprocessed version of the Enron email database. See V. Metsis, I. Androutsopoulos and G. Paliouras, Spam Filtering with Naive Bayes – Which Naive Bayes?” Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS 2006), Mountain View, CA, USA, 2006.