Requirements:
- pip install numpy
- pip install pandas
- pip install matplotplib
- pip install scipy
- python 3.6.3
As before, the single quotes, conversion to lower and everything else performed in Assignment 2 preprocess.py have been performed. Only the first 6500 entries have been considered. In addition, one hot encoding is performed after sorting, using the pandas framework. Note that the last feature is left out as a reference, which is on when the rest are all 0s.
Can be run using the command -
python preprocess-assg3.py
Outputs the mapped vectors for each of the classes.
There are two major functions here, one that calls LogisticRegresion class and one that calls SupportVectorMachines class. They perform the necessary opertations that are stated in the question using the required Hyper Parameters such as the regularization factor lambda and the learning rate. Tolerance is used as well. Finally, the number of epochs is limited to 500. Multiple stopping conditions are kept in place. All operations are performed with L2 Regularization in place.
Can be run using the command
python lr_svm.py trainingSet.csv testSet.csv 1
for Logistic Regression andpython lr_svm.py trainingSet.csv testSet.csv 2
for Linear SVM
The output provided is the Train and Test accuracy.
This part brings back NBC classifier from Assignment 2 and requires that we preprocess it using the same approach as earlier. Additionally, only the first 6500 rows are considered. A bin size of 5 is used during binning and discretization is used instead of one hot encoding. 20% of the data is used as test set and the rest as training set in this scenario. Note Naive Bayes will be using trainingSet_NBC.csv while LR and SVM will continue to use trainingSet.csv.
Can be run using the command
python cv.py
. Note this operation takes time, so print statements are added to show that there is progress
The output includes a graph learningCurves.png
that has the three curves in it. It compares the model input size to its accuracy. Using 10 fold cross validation. Further, for creating and testing a Hypothesis, we use the output arrays from cv.py in hypothesisTests.py
This produces an output of Null Hypothesis Rejected; Alternative Hypothesis accepted.
When a significance of 0.01 or 99% is used in the p-value.
Sometimes, there are warnings thrown about division by zero. These are explictly ignored in the code but can be commented out if required.