BitBoost is a gradient boosting decision tree model similar to XGBoost, LightGBM, and CatBoost. Unlike these systems, BitBoost uses bitslices to represent discretized gradients and bitsets to represent the data vectors and the instance lists, with the goal of improving learning speed.
BitBoost outperforms the other boosting systems in terms of training time when a significant number of input features are categorical and have only few possible values (i.e., low cardinality). Here are some numbers:
Time (seconds):
Allstate | Covtype1 | Covtype2 | Bin-MNIST | YouTube | |
---|---|---|---|---|---|
BitBoost Accurate | 4.8 | 17.1 | 10.7 | 4.5 | 14.3 |
BitBoost Fast | 1.0 | 5.4 | 7.2 | 1.9 | 2.5 |
LightGBM | 12.3 | 24.1 | 21.0 | 24.8 | 35.0 |
XGBoost | 11.5 | 37.0 | 35.3 | 24.7 | 24.9 |
CatBoost | 82.6 | 58.1 | 52.9 | 16.5 | 33.6 |
Accuracy (MAE, Error%, Error%, Error%, MAE):
Allstate | Covtype1 | Covtype2 | Bin-MNIST | YouTube | |
---|---|---|---|---|---|
BitBoost Accurate | 1159 | 12.0 | 0.79 | 2.78 | 0.07 |
BitBoost Fast | 1194 | 14.9 | 1.02 | 3.52 | 0.12 |
LightGBM | 1156 | 11.9 | 0.71 | 2.86 | 0.07 |
XGBoost | 1157 | 10.8 | 0.63 | 2.66 | 0.07 |
CatBoost | 1167 | 13.1 | 0.91 | 3.23 | 0.11 |
Click the column labels, or read the paper for more information.
Note: this is an experimental system, and
- BitBoost does not (yet) support multi-class classification,
- BitBoost does not (yet) support proper multi-threading,
- BitBoost does not (yet) effectively handle sparse features,
- BitBoost works best for low-cardinality categorical features,
- BitBoost can handle high-cardinality categorical and numerical features efficiently given that (1) there are not too many and (2) only coarse-grained splits are required on those features, i.e., we can have high
sample_freq
and lowmax_nbins
paramater values.
Specifically, BitBoost will most likely perform worse on fully numerical datasets. In that case, use LightGBM, XGBoost or CatBoost instead.
© DTAI Research Group - KU Leuven. Licensed under the Apache License 2.0.
Please cite this paper:
Devos, L., Meert, W., & Davis, J. (2019). Fast Gradient Boosting Decision Trees with Bit-Level Data Structures. In Proceedings of ECML PKDD. Springer.
BitBoost is implemented in stable Rust and uses the standard Rust tools, cargo
and
rustc
.
- Make sure you have Rust 2018 edition installed, that is, Rust 1.31 or higher.
- Clone this repository.
- Tell
rustc
to generate efficient AVX2 instructions (ensure you have a AVX2 capable CPU):export RUSTFLAGS="-C target-cpu=native"
- Compile the code:
cargo build --release
BitBoost is not available on pip just yet. However, you can install a pip package on your local Linux system as follows.
First, ensure you have Rust installed. Activate the Python3 environment of your liking, and run:
cd <bitboost-repo>/python
python setup.py install [--user]
Use --user
if you don't have write access to your site-packages directory. Test your installation with the following code snippet:
import numpy as np
import sklearn.metrics
from bitboost import BitBoostRegressor
# Generate some categorical data
nfeatures = 5
nexamples = 10000
data = np.random.choice(np.array([0.0, 1.0, 2.0], dtype=BitBoostRegressor.numt),
size=(nexamples * 2, nfeatures))
target = (1.22 * (data[:, 0] > 1.0)
+ 0.65 * (data[:, 1] > 1.0)
+ 0.94 * (data[:, 2] != 2.0)
+ 0.13 * (data[:, 3] == 1.0)).astype(BitBoostRegressor.numt)
# Run BitBoost
bit = BitBoostRegressor(
objective="l2", discr_nbits=4, max_tree_depth=5, learning_rate=0.5,
niterations=20, categorical_features=list(range(nfeatures)))
bit.fit(data, target)
train_acc = sklearn.metrics.mean_absolute_error(target, bit.predict(data))
BitBoost has a Scikit-learn interface. A number of examples are provided in the examples folder.
Use the run_bitboost
binary to run BitBoost from the command line:
./target/release/run_bitboost boost \
train=/path/to/train.csv \
test=/path/to/test.csv \
objective=binary \
niterations=10 \
learning_rate=0.5 \
metrics=binary_error,binary_loss \
categorical_features=0,1,2,3,4 \
sample_freq=10 \
discr_nbits=8 \
max_nbins=16
This only supports CSV input files.
All the parameters can be found in src/config.rs. The supported objectives are in src/objective.rs.
In Python, you can refer to the parameter documentation as follows:
import bitboost
help(bitboost.BitBoost)
Check out the experiments branch to see the experimental setup, or quickly navigate to the results for: