A simple C++11 header-only library implementing the Isolation Forest algorithm. The goal of this project was to help me gain a more intuitive understanding of the algorithm, while learning C++ in the process.
For use of the library, please refer to the following example.
#include "isolationforest.hh"
#include <vector>
#include <memory>
using namespace isolationforest;
int main() {
// Populate samples
SamplePtrVector samples;
SamplePtr s1 = std::make_shared<Sample>();
SamplePtr s2 = std::make_shared<Sample>();
SamplePtr s3 = std::make_shared<Sample>();
std::vector<double> data1 = {0.17, 0.29, 0.42, 0.2};
std::vector<double> data2 = {0.3, 0.2, 0.12, 0.32};
std::vector<double> data3 = {0.8, 0.1, 0.2, 0.98};
s1->index = 1;
s2->index = 2;
s3->index = 3;
s1->data = data1;
s2->data = data2;
s3->data = data3;
samples.push_back(s1);
samples.push_back(s2);
samples.push_back(s3);
// initialize parameters
int depth = std::ceil(std::log2(samples.size()));
int nfeatures = (unsigned int) (*samples.begin())->data.size();
int nestimators = 100;
float subsampling_rate = 0.9;
// intialize the isolation forest
Forest f = Forest(depth, nfeatures, nestimators, subsampling_rate);
// Fit the isolation forest
f.createForest(samples);
// Score the samples
std::map<int, double> scores = f.scoreSamples(samples);
}
For the analysis, I will be comparing the implementation in this project to that of Sci-kit Learn [1]. The algorithm in this project is implemented as proposed in the original paper [2]. The dataset used is the Credit Card Fraud Detection
dataset available on Kaggle [3].
When analyzing the training execution time of the two libraries, the C++11 library outperforms Scikit Learn when the number of subsamples is relatively low, but is slower as the number of subsamples nears 50% of the total sample group. When training the models using 256 samples as reccomended in the original article, the C++11 library ran in an average of 0.5563
seconds while Scikit Learn ran in an average of 2.3577
seconds after 100
runs. Overall, Scikit Learn is still significantly more scalable. It should be noted that these results are not consistent across machines. My 2013 Macbook shows significantly worse results for the C++ code.
The image above shows the distributions of the anomaly scores for both libraries. It quite clear that the scores created by the scikit-learn library boasts more variety.
Although I have no immediate future plans for this project, as my familiarity with C++ develops, I may improve on the design and performance of the library.
[1] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
[2] https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest
[3] https://www.kaggle.com/mlg-ulb/creditcardfraud