by Alejandro Suárez
This is my implementation of Random Forest for classification in C++11. It accepts several parameters and is able to learn and classify data sets in CSV format. Our Random Forests can deal with both numerical and categorical attributes. Moreover, trees can be stored in a JSON file and/or visualized as .dot graphs. The implementation does very basic handling of missing values: in the training set, it substitutes them by the class median (for numeric attributes) and the per-class mode (for categorical attributes). In the test set, the missing values are substittued by the global median/mode.
This implementation has been coded mainly for experimentation purposes and it does not aim at outperforming any other algorithm (although it performs quite well).
This project has been tested in Ubuntu 16.04 with g++5.4.0 and make 4.1. It has mainly two external single header dependencies:
- Richberger's args for parsing command line arguments (all the forest parameters are specified like this).
- Lohmann's json for serializing and deserializing forests in JSON format so they can be loaded afterwards.
The headers of these dependencies are already shipped with my code, so no action is required by the user to install them.
In order to build the project, simply run make
from this (README.md
's
location) folder. This will create a folder called build and compile all
the object and binaries inside. The most important binary is
train_and_test
(the other ones are modular tests).
The program's help (shown invoking ./build/train_and_test --help
) reads as follows:
./build/train_and_test [datasetname] {OPTIONS}
Train and/or test a random forest with an arbitrary data set
OPTIONS:
-h, --help Display this help menu
-v[verbose_level],
--verbose=[verbose_level] 0: no info, 1: elapsed train time,
accuracy and feature weights, if
applicable (default); 2: stats; 3+:
options
-v[seed], --verbose=[seed] RNG seed
-l[filename], --load=[filename] Load forest from JSON, instead of
training from scratch
Train parameters
-M[ntrees], --ntrees=[ntrees] Number of trees in the ensemble
(default 10)
-F[f], --feature-bag=[f] Number of features evaluated randomly
at each split (sqrt of the number of
attributes if not specified)
-N[n], --min-split=[n] Minimum number of instances to split
(default 2)
Metric
-g, --gini Gini impurity (default)
-i, --entropy Entropy (information gain)
-e, --error Error (1 - pmax)
--cv=[cv] Cross validation (by default, no cross
validation is performed)
-j[filename], --json=[filename] Store forest in JSON format
-d[prefix], --dot=[prefix] Create dot files
datasetname Name of the data set (default iris).
"--" can be used to terminate flag options and force all following
arguments to be treated as positional options
There are mainly three use cases:
In this use case, the program is invoked with the training parameters and (optionally) with
the -j
and -d
options. It will learn an ensemble of trees using one
of the data sets inside the Data
folder (more can be added following the
same convention as the ones already included). It will not perform any
training/test split. The learned model will be stored in a JSON file (if
the -j
option is given) and as several DOT files ready for being
processed and visualized. At the end of the training process, the ranking of features
and their score (based on how often they have been chosen as splitting criterion in
a stump) is displayed. In order to use the program like this, the
--cv
and -l
flags must be ommited.
Example usage:
$ ./build/train_and_test -M3 -F4 -jhepatitis_3_4.json -dhepatitis -v3 hepatitis
Verbose: 3
Train: true
ntrees: 3
f (<= 0 means sqrt of #attributes): 4
n: 2
Metric: gini
cv: 0
save to json: hepatitis_3_4.json
dot prefix: hepatitis
rng seed: 42
data set: hepatitis
----------------------------------
#Columns: 19 (+ 1 target)
#Records: 155
Missing values: 5.67063%
Target column: Class
View? no
Attributes:
Class (categoric): pdf: {1:20.6452%,2:79.3548%}
AGE (numeric): min: 7, max: 78, mean: 41.2, stdev: 12.5253
SEX (categoric): pdf: {1:89.6774%,2:10.3226%}
STEROID (categoric): pdf: {1:49.0323%,2:50.3226%,?:0.645161%}
ANTIVIRALS (categoric): pdf: {1:15.4839%,2:84.5161%}
FATIGUE (categoric): pdf: {1:64.5161%,2:34.8387%,?:0.645161%}
MALAISE (categoric): pdf: {1:39.3548%,2:60%,?:0.645161%}
ANOREXIA (categoric): pdf: {1:20.6452%,2:78.7097%,?:0.645161%}
LIVER-BIG (categoric): pdf: {1:16.129%,2:77.4194%,?:6.45161%}
LIVER-FIRM (categoric): pdf: {1:38.7097%,2:54.1935%,?:7.09677%}
SPLEEN-PALPABLE (categoric): pdf: {1:19.3548%,2:77.4194%,?:3.22581%}
SPIDERS (categoric): pdf: {1:32.9032%,2:63.871%,?:3.22581%}
ASCITES (categoric): pdf: {1:12.9032%,2:83.871%,?:3.22581%}
VARICES (categoric): pdf: {1:11.6129%,2:85.1613%,?:3.22581%}
BILIRUBIN (numeric): min: 0.3, max: 8, mean: 1.37226, stdev: 1.21605
ALK-PHOSPHATE (numeric): min: 26, max: 295, mean: 85.6194, stdev: 61.8612
SGOT (numeric): min: 14, max: 648, mean: 83.6774, stdev: 89.2384
ALBUMIN (numeric): min: 2.1, max: 6.4, mean: 3.42323, stdev: 1.31408
PROTIME (numeric): min: 0, max: 100, mean: 35.1161, stdev: 35.1081
HISTOLOGY (categoric): pdf: {1:54.8387%,2:45.1613%}
1: [2,30.000000,2,1,2,2,2,2,1,2,2,2,2,2,1.000000,85.000000,18.000000,4.000000,?,1]
2: [2,50.000000,1,1,2,1,2,2,1,2,2,2,2,2,0.900000,135.000000,42.000000,3.500000,?,1]
...
155: [1,43.000000,1,2,2,1,2,2,2,2,1,1,1,2,1.200000,100.000000,19.000000,3.100000,42.000000,2]
----------------------------------
Data set has missing values. Using per-class median/mode imputation...
Feature ranking:
ALK-PHOSPHATE (0.164384)
PROTIME (0.136986)
ALBUMIN (0.109589)
BILIRUBIN (0.0958904)
SGOT (0.0821918)
SPIDERS (0.0684932)
AGE (0.0684932)
VARICES (0.0410959)
LIVER-BIG (0.0410959)
HISTOLOGY (0.0410959)
STEROID (0.0273973)
MALAISE (0.0273973)
ASCITES (0.0273973)
ANOREXIA (0.0273973)
SEX (0.0136986)
LIVER-FIRM (0.0136986)
FATIGUE (0.0136986)
$ ls
build Data dot forests
hepatitis1.dot hepatitis2.dot hepatitis_3_4.json hepatitis3.dot
LICENSE Makefile README.md src
In this example we train a forest with 3 trees (-M3
). At each splitting point
the learner chooses the best split candidate from 4 randomly chosen features
(-F4
). The resulting forest is stored in the hepatitis_3_4.json
file and
a dot file (with the hepatitis prefix) is created for each of the trees in the
forest, ready for being processed with Graphviz. Example of how to render a dot
file:
dot -Tpdf hepatitis1.dot -o hepatitis1.pdf
The result is a PDF with rendered graph similar to the following one (this one has been actually rendered with -Tsvg
instead of -Tpdf
):
The program evaluates a certain set of Random Forest parameters via Cross-Validation.
That is, the data is splitted into several folds. It trains as many models as folds,
each time using one of the folds as test set and the remaining ones as training data.
The accuracy and time results are averaged and printed at the end of
the evaluation. To use the program like this, the -l
flag must be ommited, and --cv
must be set to an integer value greater than 1. The -j
and -d
flags are ignored.
Let us see how it is done in a practical example:
$ ./build/train_and_test -M20 --min-split=6 --cv=5 hepatitis
Data set has missing value. Using per-class median/mode imputation in
train set and global median/mode imputation in test set.
Fold 1: accuracy = 80.6452%; elapsed(CPU) = 0.022932s
Fold 2: accuracy = 87.0968%; elapsed(CPU) = 0.010616s
Fold 3: accuracy = 77.4194%; elapsed(CPU) = 0.008732s
Fold 4: accuracy = 83.871%; elapsed(CPU) = 0.00788s
Fold 5: accuracy = 87.0968%; elapsed(CPU) = 0.00792s
Accuracy: 83.2258+-3.7619%
Elapsed: 0.011616+-0.00574434s
This trains and evaluates several random forests of size 20 via 5-fold cross validation.
Loads a previously stored forest (see first use case) and evaluates it against a
different data set. All the training parameters are ignored. It is enough to
indicate the JSON file (-l
option) and the target data set.
Next we show an example of this use case:
$ ./build/train_and_test -lhepatitis_3_4.json hepatitis
Accuracy: 94.1935%
This is the same model we trained in the first use case. Notice that we are evaluating the same data the model has been trained with, so it is only natural that the accuracy is that high. It goes without saying that the reported accuracy is meaningful only when the learner has not seen the test data (this example is just for illustration purposes).
The source code has not been fully documented.