/bitboost

Fast Gradient Boosting Decision Trees with Bit-Level Data Structures

Primary LanguageRustOtherNOASSERTION

Build Status

BitBoost

BitBoost is a gradient boosting decision tree model similar to XGBoost, LightGBM, and CatBoost. Unlike these systems, BitBoost uses bitslices to represent discretized gradients and bitsets to represent the data vectors and the instance lists, with the goal of improving learning speed.

BitBoost outperforms the other boosting systems in terms of training time when a significant number of input features are categorical and have only few possible values (i.e., low cardinality). Here are some numbers:

Time (seconds):

Allstate Covtype1 Covtype2 Bin-MNIST YouTube
BitBoost Accurate 4.8 17.1 10.7 4.5 14.3
BitBoost Fast 1.0 5.4 7.2 1.9 2.5
LightGBM 12.3 24.1 21.0 24.8 35.0
XGBoost 11.5 37.0 35.3 24.7 24.9
CatBoost 82.6 58.1 52.9 16.5 33.6

Accuracy (MAE, Error%, Error%, Error%, MAE):

Allstate Covtype1 Covtype2 Bin-MNIST YouTube
BitBoost Accurate 1159 12.0 0.79 2.78 0.07
BitBoost Fast 1194 14.9 1.02 3.52 0.12
LightGBM 1156 11.9 0.71 2.86 0.07
XGBoost 1157 10.8 0.63 2.66 0.07
CatBoost 1167 13.1 0.91 3.23 0.11

Click the column labels, or read the paper for more information.

Note: this is an experimental system, and

  • BitBoost does not (yet) support multi-class classification,
  • BitBoost does not (yet) support proper multi-threading,
  • BitBoost does not (yet) effectively handle sparse features,
  • BitBoost works best for low-cardinality categorical features,
  • BitBoost can handle high-cardinality categorical and numerical features efficiently given that (1) there are not too many and (2) only coarse-grained splits are required on those features, i.e., we can have high sample_freq and low max_nbins paramater values.

Specifically, BitBoost will most likely perform worse on fully numerical datasets. In that case, use LightGBM, XGBoost or CatBoost instead.

License

© DTAI Research Group - KU Leuven. Licensed under the Apache License 2.0.

Citing

Please cite this paper:

Devos, L., Meert, W., & Davis, J. (2019). Fast Gradient Boosting Decision Trees with Bit-Level Data Structures. In Proceedings of ECML PKDD. Springer.

Compiling

BitBoost is implemented in stable Rust and uses the standard Rust tools, cargo and rustc.

  • Make sure you have Rust 2018 edition installed, that is, Rust 1.31 or higher.
  • Clone this repository.
  • Tell rustc to generate efficient AVX2 instructions (ensure you have a AVX2 capable CPU):
    export RUSTFLAGS="-C target-cpu=native"
    
  • Compile the code:
    cargo build --release
    

Using BitBoost from Python

BitBoost is not available on pip just yet. However, you can install a pip package on your local Linux system as follows.

First, ensure you have Rust installed. Activate the Python3 environment of your liking, and run:

cd <bitboost-repo>/python
python setup.py install [--user]

Use --user if you don't have write access to your site-packages directory. Test your installation with the following code snippet:

import numpy as np
import sklearn.metrics

from bitboost import BitBoostRegressor

# Generate some categorical data
nfeatures = 5
nexamples = 10000
data = np.random.choice(np.array([0.0, 1.0, 2.0], dtype=BitBoostRegressor.numt),
                        size=(nexamples * 2, nfeatures))
target = (1.22 * (data[:, 0] > 1.0)
        + 0.65 * (data[:, 1] > 1.0)
        + 0.94 * (data[:, 2] != 2.0)
        + 0.13 * (data[:, 3] == 1.0)).astype(BitBoostRegressor.numt)

# Run BitBoost
bit = BitBoostRegressor(
    objective="l2", discr_nbits=4, max_tree_depth=5, learning_rate=0.5,
    niterations=20, categorical_features=list(range(nfeatures)))
bit.fit(data, target)

train_acc = sklearn.metrics.mean_absolute_error(target, bit.predict(data))

BitBoost has a Scikit-learn interface. A number of examples are provided in the examples folder.

Running from the Command Line

Use the run_bitboost binary to run BitBoost from the command line:

./target/release/run_bitboost boost \
    train=/path/to/train.csv \
    test=/path/to/test.csv \
    objective=binary \
    niterations=10 \
    learning_rate=0.5 \
    metrics=binary_error,binary_loss \
    categorical_features=0,1,2,3,4 \
    sample_freq=10 \
    discr_nbits=8 \
    max_nbins=16

This only supports CSV input files.

Parameters

All the parameters can be found in src/config.rs. The supported objectives are in src/objective.rs.

In Python, you can refer to the parameter documentation as follows:

import bitboost

help(bitboost.BitBoost)

Paper: Fast Gradient Boosting Decision Trees with Bit-Level Data Structures

Check out the experiments branch to see the experimental setup, or quickly navigate to the results for: