Ş .İlker Birbil, Doğanay Özese, Mustafa Baydoğan
When there is a correlation between any pair of targets, one needs a prediction method that can handle vector-valued output. In this setting, multi-target learning is particularly important as it is widely used in various applications. This paper introduces new variants of decision trees that can handle not only multi-target output but also the constraints among the targets. We focus on the customization of conventional decision trees by adjusting the splitting criteria to handle the constraints and obtain feasible predictions. We present both an optimization-based exact approach and several heuristics, complete with a discussion on their respective advantages and disadvantages. To support our findings, we conduct a computational study to demonstrate and compare the results of the proposed approaches.
The details of this work are available in our paper.
The repository contains YAML file (named ocdt.yml
) to create the environment necessary to train the OCDT model. To create the environment you can use the following command:
$ conda env create --file=ocdt.yml
This will create a virtual environment with the name ocdt
. You should be able to see this environment if you run the following command:
$ conda env list
If ocdt
is listed in the virtual environments, that means that the environment is installed successfully. You can activate the environment using the command below:
$ conda activate ocdt
Currently, the repository is available for training the OCDT model.
To be able to start with the runs, we need data. There are 3 datasets used in the repository. All of these are available within the data
folder. There are 2 datasets available in the literature: scores
(data named constrained_exams.csv
) and cars
(data named insurance_evaluation_data.csv
). Additionally, there are synthetic datasets (generated by running the generate_class_data()
function within the library/data_generation.py
), which has the following naming format according to the number of targets and dataset size: class_df_size_<DATASET_SIZE>_targets_<NUMBER_OF_TARGETS>
.
ocdt_min_samples_split
: Minimum number of instances that a decision node should have in order to perform splitting.ocdt_min_samples_leaf
: Minimum number of instances that a node should have in order to become leaf node.ocdt_depth
: Maximum depth of OCDT.evaluation_method
: Evaluation metric that is used to calculate the gains of the split candidates. Available values aremse
(i.e. Mean Squared Error),mad
(i.e. Mean Absolute Deviation), andpoisson
(i.e. Poisson Deviation).prediction_method
: Prediction approach used in splitting. Available values aremean
(i.e. to return the mean of target values as prediction),medoid
(i.e. to return the median of target values as prediction),optimal
(i.e. to return the optimal values the optimization problem that minimizes MSE objective function).prediction_method_leaf
: Prediction approach used in leaves. Available values aremedoid
(i.e. to return the median of target values as prediction),optimal
(i.e. to return the optimal values the optimization problem that minimizes MSE objective function).lagrangian_multiplier
: Relaxation penalty multiplier that multiplies the penalty term. If0
, the relaxation heuristics is not used, and the original optimization problem is solved.
After the environment is activated, you can replicate the runs with results presented in the paper. For each dataset presented in the paper, dataset
parameter can be set to the values of class
, cars
, and scores
. To be able to retrieve multiple results at once, some of the parameters mentioned above are collected together to iterate over with the variable that has the suffix _list
.
Contributions are always welcome.
If you are reporting a bug, please include:
- Any details about your local setup that might be helpful in troubleshooting.
- Detailed steps to reproduce the bug.