/InferDT

The code of AAAI20 paper "Efficient Inference of Optimal Decision Trees"

Primary LanguageC++GNU General Public License v3.0GPL-3.0

Efficient Inference of Optimal Decision Trees

Paper link: http://florent.avellaneda.free.fr/dl/AAAI20.pdf

The source code is designed to be compiled and executed on GNU/Linux.

Dependencies

Example of installing dependencies for Ubuntu

The following instructions have been tested on Ubuntu 19.04

sudo apt install g++ cmake minisat2 xdot libzip-dev libboost-dev

Build

cmake .
make

Running

Infer a decision tree with the algorithm DT_depth for the dataset "mouse":

./InferDT -d data/mouse.csv infer

Infer a decision tree with the algorithm DT_size for the dataset "car":

./InferDT data/car.csv infer

Run a 10-cross-validation on the dataset "mouse" with the algorithm DT_size:

./InferDT data/mouse.csv bench

Infer a decision tree with the algorithm DT_size for the training set balance-scale.csv.train1 and evaluate the model with the testing set balance-scale.csv.test1:

./InferDT data/balance-scale.csv.train1.csv infer -t data/balance-scale.csv.test1.csv

Print a help message:

./InferDT --help

Benchmarks

We ran experiments on Ubuntu with Intel Core CPU i7-2600K @ 3.40 GHz.

Verwer and Zhang Datasets

The datasets we used are extracted from the paper of Verwer and Zhang and are available at (https://github.com/SiccoVerwer/binoct).

Dataset S B Time DT_depth Accuracy DT_depth k n for DT_size Time DT_size Accuracy DT_size Accuracy BinOCT*
iris 150 114 18 ms 92.9 % 3 10.6 30 ms 93.2 % 98.4 %
monks1 124 17 24 ms 90.3 % 4.4 17 80 ms 95.5 % 87.1 %
monks2 169 17 190 ms 70.2 % 5.8 47.8 9.1 sec 74.0 % 63.3 %
monks3 122 17 30 ms 78.1 % 4.8 23.4 210 ms 82.6 % 93.5 %
wine 178 1276 600 ms 89.3 % 3 7.8 1.2 sec 92.0 % 88.9 %
balance-scale 625 20 50 sec 93.0 % 8 268 183 sec 92.6 % 77.5 %

S: Number of examples in the dataset

B: Number of Boolean features in the dataset

Time DT_depth: Time used by our algorithm with parameter "-d"

Accuracy DT_depth: Accuracy of our algorithm with parameter "-d"

k: Depth of inferred decision trees

n: Number of nodes in inferred decision trees

Time DT_size: Time used by our algorithm without parameter "-d"

Accuracy DT_size: Accuracy of our algorithm without parameter "-d"

Accuracy BinOCT: Accuracy of algorithm from https://github.com/SiccoVerwer/binoct

Mouse

We used dataset Mouse that the authors Bessiere, Hebrard and O'Sullivan shared with us. Each entry in rows DT_size and DT_depth corresponds to the average over 100 runs. The first columns correspond to the name of each algorithm used. The next three columns correspond to inferring a decision tree from the whole dataset. The last column corresponds the 10-fold cross-validations.

Algorithm Time k n Accuracy
DT2 from paper Minimising Decision Tree Size as Combinatorial Optimisation 577 sec 4 15 83.8 %
DT1 from paper Learning Optimal Decision Trees with SAT 12.9 sec 4 15 83.8 %
DT_size: our algorithm without parameter "-d" 70 ms 4 15 83.5 %
DT_depth: our algorithm with parameter "-d" 20 ms 4 31 85.8 %

Other

In this section we perform x 10-cross-validations made randomly and record the average.

Dataset S B Time DT_depth Accuracy DT_depth k n for DT_size Time DT_size Accuracy DT_size x
zoo 101 136 47 ms 91.7 % 4 20 200 ms 91 % 200
BodyMassIndex 500 172 53 sec 85 % 6.6 109 6.4 h 85.4 % 1
lungCancerDataset 59 72 3.8 ms 89.6 % 2.6 7 7.5 ms 90.6 % 200
iris 113 114 30 ms 93.6 % 3.6 14 120 ms 94.0 % 200
monks1 124 17 30 ms 97.0 % 4.9 15 140 ms 99.5 % 200
monks2 169 17 330 ms 88.0 % 6 64 9.3 sec 88.7 % 100
monks3 92 17 125 ms 79.6 % 5.8 35.5 2.5 sec 81.5 % 800
balance-scale 625 20 200 sec 71.8 % 9 276 11 h 73.6 % 1
wine 178 1276 4.5 sec 91.5 % 3 12.2 14.5 sec 91.2 % 200

S: Number of examples in the dataset

B: Number of Boolean features in the dataset

k: Average depth of inferred decision trees

n: Average number of nodes in inferred decision trees

x: Number of 10-cross-validation performed