Spam vs Ham Classifier that classifies whether a message is spam or ham using Logistic Regression, Which is implemented from scratch.
For running the classifier, I shall recommend installing anaconda for python 3.7. This ensures all the basic dependencies. Moreover, you may create a virtual environment using the environment.yml file uploaded in the repository. The command to create the virtual environment with the environment.yml file is:
conda env create -f environment.yml
The first line of .yml file sets the name of the environment
conda activate {env_name}
This .yml file shall take care of all the important dependencies required for Machine Learning and Deep Learning.
for installing any package package_name
pip install package_name
In this step, we imported the important python libraries for linear algebra, data preprocessing and Natural Language Processing.
- numpy
- pandas
- string
- scikitlearn
- Natural Language toolkit (NLTK)
- pyplot
- scipy
The dataset was imported and was processed as per following sequence:
- Cleaning the data by removing punctuations and stopwords
- Stemming : reducing the words to roots
- TfIdf vectorization : creating a proper matrix of words appearing, with proper weightage given to each word
The steps to code the logistic regression classifier from scratch are:
- Define the logistic funtion, more popularly known as sigmoid function
- Define the logistic cost function
- Define the gradient descent algorithm to get optimum parameters
- Initialise the matrix of parameters, the learning rate, the number of iterations
- split the data into training set and test set, and make predictions on the test set.
In this classifier, I have used learning rate = 0.01 and 10,000 iterations. This resulted in an accuracy of 86.54%. Feel free to play around with these two hyper parameters to get better results.