SpamVsHam-classifier-using-Logistic-Regresion

Spam vs Ham Classifier that classifies whether a message is spam or ham using Logistic Regression, Which is implemented from scratch.

SETTING UP THE ENVIRONMENT

For running the classifier, I shall recommend installing anaconda for python 3.7. This ensures all the basic dependencies. Moreover, you may create a virtual environment using the environment.yml file uploaded in the repository. The command to create the virtual environment with the environment.yml file is:

conda env create -f environment.yml

The first line of .yml file sets the name of the environment

conda activate {env_name}

This .yml file shall take care of all the important dependencies required for Machine Learning and Deep Learning.

for installing any package package_name

pip install package_name

STEPS TO CREATE THE FILTER

1. DATA PREPROCESSING:

In this step, we imported the important python libraries for linear algebra, data preprocessing and Natural Language Processing.

The Libraries imported are:

numpy
pandas
string
scikitlearn
Natural Language toolkit (NLTK)
pyplot
scipy

The dataset was imported and was processed as per following sequence:

Cleaning the data by removing punctuations and stopwords
Stemming : reducing the words to roots
TfIdf vectorization : creating a proper matrix of words appearing, with proper weightage given to each word

2. LOGISTIC REGRESSION CLASSIFIER

The steps to code the logistic regression classifier from scratch are:

Define the logistic funtion, more popularly known as sigmoid function
Define the logistic cost function
Define the gradient descent algorithm to get optimum parameters
Initialise the matrix of parameters, the learning rate, the number of iterations
split the data into training set and test set, and make predictions on the test set.