We can predict without binning : threshold instead of bin_threshold
Opened this issue · 4 comments
For now, we need to apply the Binner to a matrix of features for which we want to produce predictions, while we can avoid this. In node_dtype we already save threshold and bin_threshold, but we don't use threshold for now. We can get from the bin_threshold the value of the threshold, without having to apply the Binner. This is easy for continuous features, we need to check for categorical ones
In grow
, we will need to store the real threshold
for numerical threshold, that would mean we this information on all bin_thresholds while we make splits.
- To add
bin_thresholds
attribut intree_context
? - Actually
bin_thresholds
from binner is a list of variable length array (of len <= 254). To store this, shall we maketree_context.bin_thresholds
a matrix of (num_features, 254)?
Yes this information should be stored in the tree_context, but you need to check how to proceed for categorical features
OK, so
- we add
bin_thresholds
attribut intree_context
, ndarray of float64 (or float32 ??) of size (num_features, 254) - for categorical features, if we agree that data in input are already processed such that they are already 0., 1., ... and Binner does not change anything apart from their type from float to uint8; then I think we might just use the float 0., 1., ... , convert it manually to uint8 and follow what we did as for bin number (follow what is in
tree.bin_partitions
)
I just made a very first version of this, and only on numerical features for now.
My feeling is that this might make prediction a bit faster, but what really takes time in prediction is aggregation.
I think float32 for bin_thresholds since everything is float32
Please add tests on several small datasets that prediction with and without binning lead to the exact same results