pyensemble/wildwood

We can predict without binning : threshold instead of bin_threshold

Opened this issue · 4 comments

For now, we need to apply the Binner to a matrix of features for which we want to produce predictions, while we can avoid this. In node_dtype we already save threshold and bin_threshold, but we don't use threshold for now. We can get from the bin_threshold the value of the threshold, without having to apply the Binner. This is easy for continuous features, we need to check for categorical ones

In grow, we will need to store the real threshold for numerical threshold, that would mean we this information on all bin_thresholds while we make splits.

  • To add bin_thresholds attribut in tree_context?
  • Actually bin_thresholds from binner is a list of variable length array (of len <= 254). To store this, shall we make tree_context.bin_thresholds a matrix of (num_features, 254)?

Yes this information should be stored in the tree_context, but you need to check how to proceed for categorical features

OK, so

  • we add bin_thresholds attribut in tree_context, ndarray of float64 (or float32 ??) of size (num_features, 254)
  • for categorical features, if we agree that data in input are already processed such that they are already 0., 1., ... and Binner does not change anything apart from their type from float to uint8; then I think we might just use the float 0., 1., ... , convert it manually to uint8 and follow what we did as for bin number (follow what is in tree.bin_partitions)

I just made a very first version of this, and only on numerical features for now.
My feeling is that this might make prediction a bit faster, but what really takes time in prediction is aggregation.

I think float32 for bin_thresholds since everything is float32
Please add tests on several small datasets that prediction with and without binning lead to the exact same results