/martin-forest

Entropy-reducing Random Forest in C++

Primary LanguageC++

Martin Forest 1.1

by Martin Kolar, 12.2013

Train a random forest of Entropy-reducing trees on a csv input, capable of multiclass classification. The trained forest is stored in a .forest file.

Evaluate a set of unlabeled data from a csv file, which can handle '?' for unknown values. The output .classification file is a csv of votes from each tree. The votes for each label are summed over all trees, for selection or ranking applications. A tree may be uncertain when deciding in a node (when the value is '?'), and will follow both branches. For each unknown entry in each datapoint, usefulness can also be calculated, in order to decide the importance of data which could be gathered. This sums the importance of each feature over all trees.

Example data is the Breast Cancer Wisconsin (Original) Data Set dataset, available http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)

COMPILING:
make

INSTALLING:
make install

TESTING:
make test

FILE FORMAT:
Training file:
label, value1, value2, value3, ..., valueN
label, value1, value2, value3, ..., valueN

Testing file:
value1, value2, value3, ..., valueN
value1, value2, value3, ..., valueN

Output .classification file:
label1, label2, label3, ..., labelL (a list of all labels, so that the order is clear)
votes_label1, votes_label2, votes_label3, ..., votes_labelL
votes_label1, votes_label2, votes_label3, ..., votes_labelL

(for real examples, see the real world files:
Training file: Wisconsin_Breast_Cancer_train.csv
Testing file: Wisconsin_Breast_Cancer_test_nolabels.csv
Output .classification file: breast.classification


USAGE:
training: (as of version 1.0, rows with '?' entries are ignored)
./forest_train *training_csv_file* *number_of_trees* *.forest_output_file*
./forest_train ../Wisconsin_Breast_Cancer_train.csv 5 breast.forest

evaluation:
./forest_eval *testing_csv_file* *.forest_input_file* *.classification_output_file*
./forest_eval ../Wisconsin_Breast_Cancer_test_nolabels.csv breast.forest breast.classification
or
./forest_eval *testing_csv_file* *.forest_input_file* *.classification_output_file* *usefulness_output_file*
./forest_eval ../Wisconsin_Breast_Cancer_test_nolabels.csv breast.forest breast.classification unknowns_usefulness.entropy