/obliquetree-classifier

Oblique Tree Classifier in Python

Primary LanguagePythonApache License 2.0Apache-2.0

scikit-obliquetree-classifier

Oblique Decision Tree in Python

Introduction

Decision trees are a popular machine learning method that is highly interpretable. The vast majority of publicly available decision tree libraries implements an axis-parallel version of decision trees, where splits at each decsion node involve only a single feature variable. This leaves room for performance improvement because we might learn patterns in a dataset better by considering oblique decision boundaries.

In comparison to axis-parallel trees, oblique decision trees partition a feature space by drawing half-spaces involving all feature variables. However, despite much research showing the exceptional performance of oblique decision trees, there is the lack of an open-source package that implements an oblique decision tree classificaton algorithm.

This gap in technical infrastructure motivates us to program and publish the Python implementation of the HHCART algorithm (Wickramarachchi et al. 2016) for classification tasks. We hope that this repository will be a handy tool for researchers data scientists who want to leverage the increased representation power of oblique decision trees.

🚀 Features

  • A simple scikit-learn interface for oblique decision tree classifiers
  • Provides a wrapper class in HHCART_vis.py to allow for convenient tree visualization

Example

You can find a more detailed example in the Jupyter notebook example.ipynb.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score
from HHCART import HouseHolderCART  # oblique tree classifier
from segmentor import Gini, TotalSegmentor  # module to determine splits
import numpy as np
import itertools, time

# Load training data - we use the Iris dataset as an example
X, y = load_iris(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 1)

# Initialize an HHCART classifier object
sgmtr = TotalSegmentor()
HHTree = HouseHolderCART(impurity = Gini(), segmentor = sgmtr, max_depth = 5, 
                                    min_samples = 4)
# max_depth: maximum depth of the decision tree. 
# min_samples: minimum allowed number of samples in a terminal node

# Train the classifier
HHTree.fit(x_train, y_train)

# Evaluate the classifier performance
train_score = accuracy_score(y_train, HHTree.predict(x_train))
test_score = accuracy_score(y_test, HHTree.predict(x_test))
print(f"train accuracy: {train_score:.00%}")
print(f"test accuracy: {test_score:.00%}")

🛡 License

License

This project is licensed under the terms of the Apache Software License 2.0 license. See LICENSE for more details.

📃 Bibliography

  1. Wickramarachchi et al., HHCART: An oblique decision tree, Computational Statistics & Data Analysis, Volume 96, 2016, Pages 12-23. (Link)
  2. Hengzhe Zhang, Oblique Decision Tree in Python, GitHub repository, 2021. (Link)

Acknowledgements

I would like to thank Github user hengzhe-zhang (2nd citation) for generously sharing the algorithm he developed for growing decision trees to complete regression tasks. My work generalizes his code by enriching his code to generate trees that can also perform classification tasks. You can refer to hengzhe-zhang's repo from here.

This project was generated with python-package-template.