/MLClassifiersInCPP

K-NN, Naive-Bayes & Baseline classifiers for the LingSpam dataset, written in C++.

Primary LanguageC++MIT LicenseMIT

MLClassifiersInCPP

3 classifiers for the LingSpam dataset, using tf-idf features, written in C++.

  1. k-NN classifier
  2. Naive Bayes classifier
  3. Baseline classifier

The k-NN classifier either uses Euclidean distances or cosine similarity as the metric measure.

The Baseline classifier is a dummy classifier that either classifies all the data with the most frequent label in the training set or with random labels altogether.

The program is tested on a Linux machine.

How to compile

Run the script "compile.sh". Type:

./compile.sh

How to run

  1. First construct the dataset. Run:
./bin/construct_dataset.o
  1. Then, classify the dataset, with the 3 classifiers. Run:
./bin/Main.o

Experiment results

Classifier Accuracy Precision Recall test wrong TP TN FP FN
10-NN Classifier using Euclidean distances metric 91.35 % 65.75 % 100 % 289 25 48 25 216 0
1-NN Classifier using Euclidean distances metric 88.93 % 60 % 100 % 289 32 48 32 209 0
10-NN Classifier using Cosine similarity metric 93.08 % 71.88 % 95.83 % 289 20 46 18 223 2
1-NN Classifier using Cosine similarity metric 78.2 % 37.29 % 45.83 % 289 63 22 37 204 26
Naive Bayes Classifier 96.19 % 93.02 % 83.33 % 289 11 40 3 238 8
Baseline Classifier (Most Frequent label strategy) 83.4 % 65.75 % 0 % 289 48 0 0 241 48
Baseline Classifier (Random labels strategy) 46.71 % 13.7 % 41.67 % 289 154 20 126 115 28