laudv/bitboost

installation on google colab, self.numt assert

Opened this issue · 2 comments

Hi @laudv

  1. Thank you for your paper and for sharing your project; it presents a very interesting approach.

  2. I've successfully built and installed it on Google Colab. You can find it here:
    https://colab.research.google.com/drive/1lz6ps34TWMsVm07cnW3EC--pIPTY4PgT?usp=sharing
    I'm relatively new to Rust, so I'm not entirely sure if I've followed all the necessary steps correctly. During the build process, I encountered several warnings like the one below:

warning: unused import: `BitsliceLayout`
  --> src/count_and_sum.rs:11:23
   |
11 | use crate::bitslice::{BitsliceLayout, BitsliceWithLayout};
   |                       ^^^^^^^^^^^^^^
   |
   = note: `#[warn(unused_imports)]` on by default

but your example seems to run smoothly.

  1. I've attempted to test BitBoost with the Numerai dataset, as it appears to be a perfect fit (features with only 0, 1, 2, 3, 4 values). The training phase seems fine, but during prediction, I encountered an assert error:
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[<ipython-input-20-7aef979a8edf>](https://localhost:8080/#) in <cell line: 1>()
----> 1 validation[f"prediction_{target}"] = model2.predict(validation[features].to_numpy())

1 frames
[/content/bitboost/python/bitboost/sklearn.py](https://localhost:8080/#) in predict(self, X)
     68         check_is_fitted(self, "_is_fitted")
     69 
---> 70         self._bitboost.set_data(X)
     71         return self._bitboost.predict()
     72 

[/content/bitboost/python/bitboost/bitboost.py](https://localhost:8080/#) in set_data(self, data, cat_features)
    126         self._check()
    127         assert isinstance(data, np.ndarray)
--> 128         assert data.dtype == self.numt
    129         assert data.shape[1] == self._nfeatures
    130 

AssertionError:

I'm not sure what's causing it or how to resolve the issue.

  1. In lightGBM, I'm using following parameters:
model = lgb.LGBMRegressor(
    n_estimators=100,  # If you want to use a larger model we've found 20_000 trees to be better
    learning_rate=0.01, # and a learning rate of 0.001
    max_depth=5, # and max_depth=6
    num_leaves=2**5-1, # and num_leaves of 2**6-1
    colsample_bytree=0.1
)

To ensure an "apples-to-apples" comparison, how should I configure BitBoost?
It'd be great if BitBoost could give similar accuracy with fraction of time :)

Best regards,
Marek

laudv commented

Hi Marek,

Thanks for the clear overview and thanks for taking interest in BitBoost.

First a disclaimer. I made BitBoost about 4-5 years ago now. As is often the case with research code, unfortunately, BitBoost is not a nicely packaged, finished product. How exactly do you want to use it? Is it for research purposes, or do you want to use it in another way? I would not recommend using BitBoost in a production environment.

If you are trying BitBoost to experiment with the bit-level optimizations, then I'm very happy to provide help where necessary.

  1. The unused import warnings are resolved by removing the unused imports.
  2. If that dataset contains many low-cardinality categorical features, then yes, it is a good fit. The error seems to indicate that your evaluation data is of the wrong dtype. Try .astype(BitBoostRegressor.numt) on your numpy array.
  3. You should also use feature subsampling in BitBoost (use the feature_fraction parameter). This corresponds to the colsample_bytree paramter of LightGBM.

Best, Laurens

Hi Laurens,

Indeed, .astype(BitBoostRegressor.numt) solves the issue :)
My main goal is to have a faster and more memory-efficient library than LightGBM for datasets like Numerai, and BitBoost seems to be a good starting point for experiments.
The first idea that comes to my mind is to replace ctypes float with bfloat16 (https://crates.io/keywords/bfloat16). I'll try to investigate it.

Best regards,
Marek