Output-Constrained Decision Trees (OCDT)

Ş .İlker Birbil, Doğanay Özese, Mustafa Baydoğan

When there is a correlation between any pair of targets, one needs a prediction method that can handle vector-valued output. In this setting, multi-target learning is particularly important as it is widely used in various applications. This paper introduces new variants of decision trees that can handle not only multi-target output but also the constraints among the targets. We focus on the customization of conventional decision trees by adjusting the splitting criteria to handle the constraints and obtain feasible predictions. We present both an optimization-based exact approach and several heuristics, complete with a discussion on their respective advantages and disadvantages. To support our findings, we conduct a computational study to demonstrate and compare the results of the proposed approaches.

The details of this work are available in our paper.

Code Repository

The repository contains YAML file (named ocdt.yml) to create the environment necessary to train the OCDT model. To create the environment you can use the following command:

$ conda env create --file=ocdt.yml

This will create a virtual environment with the name ocdt. You should be able to see this environment if you run the following command:

$ conda env list

If ocdt is listed in the virtual environments, that means that the environment is installed successfully. You can activate the environment using the command below:

$ conda activate ocdt

Currently, the repository is available for training the OCDT model.

Data

To be able to start with the runs, we need data. There are 3 datasets used in the repository. All of these are available within the data folder. There are 2 datasets available in the literature: scores (data named constrained_exams.csv) and cars (data named insurance_evaluation_data.csv). Additionally, there are synthetic datasets (generated by running the generate_class_data() function within the library/data_generation.py), which has the following naming format according to the number of targets and dataset size: class_df_size_<DATASET_SIZE>_targets_<NUMBER_OF_TARGETS>.

Parameters

ocdt_min_samples_split: Minimum number of instances that a decision node should have in order to perform splitting.
ocdt_min_samples_leaf: Minimum number of instances that a node should have in order to become leaf node.
ocdt_depth: Maximum depth of OCDT.
evaluation_method: Evaluation metric that is used to calculate the gains of the split candidates. Available values are mse (i.e. Mean Squared Error), mad (i.e. Mean Absolute Deviation), and poisson (i.e. Poisson Deviation).
prediction_method: Prediction approach used in splitting. Available values are mean (i.e. to return the mean of target values as prediction), medoid (i.e. to return the median of target values as prediction), optimal (i.e. to return the optimal values the optimization problem that minimizes MSE objective function).
prediction_method_leaf: Prediction approach used in leaves. Available values are medoid (i.e. to return the median of target values as prediction), optimal (i.e. to return the optimal values the optimization problem that minimizes MSE objective function).
lagrangian_multiplier: Relaxation penalty multiplier that multiplies the penalty term. If 0, the relaxation heuristics is not used, and the original optimization problem is solved.

Experiments

After the environment is activated, you can replicate the runs with results presented in the paper. For each dataset presented in the paper, dataset parameter can be set to the values of class, cars, and scores. To be able to retrieve multiple results at once, some of the parameters mentioned above are collected together to iterate over with the variable that has the suffix _list.

Contribution

Contributions are always welcome.

If you are reporting a bug, please include: