/bonsai-dt

Programmable Decision Tree Framework

Primary LanguagePython

Bonsai-DT (or Bonsai in short form) is a programmable decision tree framework written in Python (and a little bit of Cython). Using Bonsai, you can quickly design/build/customize new decision tree algorithms simply by writing these two functions:

  • find_split()
  • is_leaf()

The intent of this project is to provide a quick testing bed for various decision tree ideas. Although the speed of Bonsai is comparable to other implementations, we note that the speed and scalability of the framework are not the primary focus.

Many decision trees, such as C4.5 and CART, are already implemented and available in the Bonsai templates. Even ensemble models, such as Gradient Boosting, Random Forests and PaloBoost, are readily available. For the full list of the Bonsai templates, please see the Model Templates section.

Contents

  1. Installing
  2. Background
  3. Design Functions
  4. Code Examples
  5. Model Templates
  6. Glossaries
  7. Authors
  8. License

Installing

  • Required: numpy and cython
  • Run python setup.py develop
  • NOTE: Bonsai is still under rapid development. Please use this at your own discretion.

Background

Decision tree is a recursive data partitioning algorithm. Some well-known algorithms include CART (Classification and Regression Trees), C4.5, ID3, and CHAID. Although these decision trees may look very differerent from outside, without loss of generality, the inner-workings of the most trees can be written as follows:

def decision_tree(X, y):

    # Stopping criteria
    if is_leaf(X, y):     
        return np.mean(y)

    split_variable, split_value = find_split(X, y)

    # Partition data into two sets
    X_right, X_left, y_right, y_left = partition_data(X, y, 
        split_variable, split_value)  

    # Recursive call of decision_tree()
    return {"split_variable": split_variable,
        "split_value": split_value,
        "right": decision_tree(X_right, y_right), 
        "left": decision_tree(X_left, y_left)}   

In the code above, X and y represent feature matrix and target vector, respectively. The algorithm first looks at if it should stop the recursion - is_leaf(). If not, then it searches a splitting variable and value pair to branch out further - find_split(). Based on the best pair of splitting variable and value, the dataset (X, y) is partitioned into two disjoint sets - one with smaller value than the splitting value, and the other with greather than or equal to the splitting value. Finally, the algorithm recursively repeats the above process on each partitioned dataset.

As can be seen, the behavior of a decision tree is primarily governed by the two functions:

  • find_split() decides which variable/value to split on. For example, C4.5 and ID3 use Information Gain using Shannon Entropy, and CART uses the Gini impurity measure (for classification) and the minimum variance criterion (for regression) to choose a splitting variable.
  • is_leaf() controls when to stop growing the tree. For example, a tree can stop growing if its depth is greater than 4 or its node size is smaller than 30.

Many decision tree algorithms implement these functions internally, and expose only a set of parameters through their interfaces. This approach is definitely better for users who want to try the off-the-shelf decision trees. However, if we want to design a new tree that does not exist anywhere, we need to edit the underlying source code, which may be quite complex and time-consuming. Is there a better way? That's why we developed Bonsai.

Design Functions

Rather than a set of parameters, Bonsai takes two functions as its arguments. In this section, we illustrate the details of these functions, such as arguments to pass on, output formats, and some example implementations.

find_split(avc)

The find_split function has only one argument avc. The avc variable is a numpy array that has 10 columns and variable rows. The name avc stands for Attribute-Value Classlabel Group (or AVC-group), which was first introduced in the Rainforest framework.

Here is how Bonsai works in a nutshell. Whenever we call find_split, Bonsai scans the data, and updating a couple of online statistics such as mean, second moment, count, etc. We call this process as Bonsai "sketches" the stats on canvas, where canvas represents just an empty numpy array. When the sketch is done, the empty numpy array, canvas, is filled with all relevant statistics for defining splitting criteria. Just to be clear, we call the canvas after sketch as avc. The column order of avc is as follows:

  • avc[:,0]: AVC indices
  • avc[:,1]: variable indices i.e. column indices starting from 0
  • avc[:,2]: split values i.e. samples below this value goes to the left node, and vice versa
  • avc[:,3]: number of samples at (hypothetical) left node
  • avc[:,4]: sum y at left node i.e. \sum_{left} y
  • avc[:,5]: sum of y^2 at left node i.e. \sum_{left} y^2
  • avc[:,6]: number of samples at (hypothetical) right node
  • avc[:,7]: sum of y at right node i.e. \sum_{right} y
  • avc[:,8]: sum of y^2 at right node i.e. \sum_{right} y^2
  • avc[:,9]: missing value inclusion (0: left node, 1: right node)

Essentially, the find_split function is given with the joint distribution information in the form of AVC-group. With this information, the function needs to define what properties of the distributions should paly roles in finding the best splitting variable and value.

The return of the function is a Python dictionary with the following required key value pairs:

  • selected: the selected row of the avc array
  • <additional var>: any additional variables you want to pass on

For example, if you want to design a splitting criteria that minimizes the weighted sum of variances after the split (e.g. regression tree in CART), you can write find_split as follows:

def find_split(avc):
    var_left = avc[:,5]/avc[:,3] - np.square(avc[:,4]/avc[:,3])
    var_right = avc[:,8]/avc[:,6] - np.square(avc[:,7]/avc[:,6])
    varsum = avc[:,3] * var_left  + avc[:,6] * var_right
    min_idx = np.argsort(varsum)[0]
    return {"selected": avc[min_idx,:]}

is_leaf(branch, branch_parent)

The is_leaf function takes two arguments: one for the current branch, and the other for the parent branch. Both variables are Python dictionaries. By default, these branch variables have the following key value pairs, but can have additional key value pairs if users defined <additional var> in find_split:

  • depth: the depth of the branch in the tree
  • n_samples: the number of samples in the branch

The return of this function is boolean. If the branch is a leaf, then returns True. Otherwise, returns False.

For example, if you want to design a stopping criteria that stops at depth 3, you can write is_leaf as follows:

def is_leaf(branch, branch_parent):
    if branch["depth"] > 2:
        return True
    else:
        return False

Putting Things Together

With the find_split and is_leaf written, now you can put these together to make your own decision tree as follows:

from ..core.bonsai import Bonsai
import numpy as np

class MyTree(Bonsai):

    def __init__(self):

        def find_split(avc):
            var_left = avc[:,5]/avc[:,3] - np.square(avc[:,4]/avc[:,3])
            var_right = avc[:,8]/avc[:,6] - np.square(avc[:,7]/avc[:,6])
            varsum = avc[:,3] * var_left  + avc[:,6] * var_right
            min_idx = np.argsort(varsum)[0]
            return {"selected": avc[min_idx,:]}

        def is_leaf(branch, branch_parent):
            if branch["depth"] > 2:
                return True
            else:
                return False

        Bonsai.__init__(self, find_split, is_leaf)

Now, you can use MyTree just by importing the class in your project.

Code Examples

We provide several examples of using Bonsai.

Using a Model Template

Probably the easiest and fastest way to use Bonsai is using the model templeates in Bonsai. The model templates are decision trees that are already built in Bonsai, which include regression tree, C4.5, Alpha-Tree, etc. For example, if you want to use a regression tree example:

# Pre-built regression tree using Bonsai
from bonsai.regtree import RegTree 

from sklearn.datasets import make_friedman1
from sklearn.model_selection import train_test_split
import numpy as np

X, y = make_friedman1(n_samples=100000) 
n, m = X.shape
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Initialize
model = RegTree(max_depth=5),

# Fit
model.fit(X_train, y_train)

# Predict
y_hat = model.predict(X_test)

rmse = np.sqrt(np.mean((y_test - y_hat)**2))

Available model templates are listed in the Model Templates section. Also, take a look at the test scripts under the tests folder to see how to use these templates.

Running a Test Script

The best to way to get used to Bonsai is try running the test scripts. To run the test scripts, if you just go to the tests folder, and run a Python script in the folder.

$ cd tests
$ python paloboost.py 

# Test Regression
-----------------------------------------------------
 model_name     train_time     predict_time   rmse   
-----------------------------------------------------
 baseline       -              -              6.99054
 gbm            3.24681 sec    0.32504 sec    5.76867
 palobst        4.15213 sec    0.28674 sec    5.14846
 sklearn        6.36740 sec    0.07248 sec    5.68470
-----------------------------------------------------

This test script tests the PaloBoost algorithm using the Friedman's simulated data (regression task). As can be seen, the test script prints the training time, predictio tiem, and Root Mean Squared Errors (RMSE) measured on hold-out data.

Interpreting a Trained Bonsai Tree

In Bonsai, you can easily check the details of a trained tree:

from __future__ import print_function
from bonsai.base.regtree import RegTree
from sklearn.datasets import make_friedman1
import json
X, y = make_friedman1(n_samples=10000) 
model = RegTree(max_depth=1)
model.fit(X, y)
print(json.dumps(model.dump(), indent=2))

This script will output the trained tree in the form of an array of leaf nodes. Here is an example of the output:

[
  {
    "is_leaf": true, 
    "eqs": [
      {
        "svar": 3, 
        "missing": 1, 
        "sval": 0.49608099474494854, 
        "op": "<"
      }
    ], 
    "depth": 1, 
    "n_samples": 5031.0, 
    "y": 11.798433612486887, 
  }, 
  {
    "is_leaf": true, 
    "eqs": [
      {
        "svar": 3, 
        "missing": 0, 
        "sval": 0.49608099474494854, 
        "op": ">="
      }
    ], 
    "depth": 1, 
    "n_samples": 4969.0, 
    "y": 16.969889465743922, 
  }
]

As this is a depth-1 tree, we only see two leaves in the array. In each leaf, you see eqs that stores the logical rules for the leaf, and y that indicates the node value or predicted value for the leaf. All Bonsai-derived trees would have this form of output.

Model Templates

Here is the list of Bonsai templates:

Glossaries

Variable/function names that are used in the source code:

  • avc: AVC stands for Attribute-Value Classlabel Group.
  • canvas: Canvas is a numpy array that stores AVC.
  • canvas_dim: The dimension of Canvas. #rows x #columns
  • setup_canvas: Setting up a canvas for AVC.
  • sketch: Sketching refers to a process of filling AVC values to Canvas. We scan a dataset and construct statistics relevant to AVC, and fill those values to a canvas one by one. We thought the process resembles skteching in paiting - rapid drawing for setting up a blueprint.
  • erase_canvas: Erase the values in a canvas.

Authors

  • Yubin Park, PhD
  • If you want to contribute, please contact Yubin

License

Apache License 2.0