tensorflow/decision-forests

max_vocab_count won't work for CATEGORICAL integerized in tfdf.keras.GradientBoostedTreesModel

advahadr opened this issue · 5 comments

Hi All,

versions:
python 3.9
tensorflow_decision_forests==0.2.6
tensorflow==2.9.1

Running on AWS instance type: ml.m5.24xlarge

Problem description:
When setting max_vocab_count in tfdf.keras.FeatureUsage and in tfdf.keras.GradientBoostedTreesModel to 20, features of type: CATEGORICAL integerized won't be affected and original vocabulary size will be used, while in features of type: CATEGORICAL has-dict max_vocab_count will be applied correctly:
Please see the statistics on the log for example, both using the same feature usage:

"request_id" CATEGORICAL integerized vocab-size:8806 no-ood-item
"request_tile" CATEGORICAL has-dict vocab-size:21 num-oods:2823 (0.0014115%) most-frequent:"851fb467fffffff" 2395895 (1.19795%)

request_id is ignored by the guide and doesn't use the max_vocab_count,
request_tile is handled correctly.

Will appreciate your help, Thank you

rstz commented

Hi,
thank you for the report.
Version 0.2.6 of tf-df is fairly outdated, the project is at Version 1.5.0 now. Can you please confirm that the issue occurs on the latest version of TF-DF?
Best,
Richard

rstz commented

Hi, I looked into the code a bit more deeply and the behaviour you're seeing is expected and probably hasn't changed since 0.2.6.
Pre-Integerized categorical features (i.e., categorical features represented as integers) are not post-processed using the guide information to allow users a fine-grained control of how values are processed by TF-DF. This is why guide settings like max_vocab_count are not applied to them.

The simplest fix is to feed the feature as a string instead of an int, so TF-DF will not assume it's pre-integerized.

Alternatively, you can continue feeding the values as integers and set every value but the max_vocab_count most frequent to 0 (which represents out-of-dictionary items). When doing this, you have to set advanced_arguments=tfdf.keras.AdvancedArguments(disable_categorical_integer_offset_correction=False) as an argument of the model constructor to allow TF-DF to properly recognize this. Also note that for integerized columns, the human-readable summary of the data spec (printed above) does not correctly count the number of ood items - this is a bug to be fixed in one of the next versions.

Hi,

Thank you for your quick reply.
The suggested workaround seems reasonable, however, I'm limited to data manipulation (we want to serve it in production and are limited by the latency and data types), in addition, the tfdf model is used as a layer in a bigger deep architecture, so I we are facing limitation both in versions and data manipulations.

If I want to avoid data manipulation, can you please point me to the code where I can apply the guide to Pre-Integerized categorical features as well?

I saw this src that is relevant to the release we are using, using it as a custom fix will be the best solution facing our limitation.

Thank you for your help!

rstz commented

Hi,

The relevant piece is in this function. If you read through it, you'll see some preprocessing is not applied if the column is integerized. You'll have to make sure that the parameter max_number_of_unique_values is correctly applied even for integerized columns. I think this is not going to be completely trivial, since no statistical information (how often does each integer appear) is collected for integerized columns; that means you'll have to make sure this information is computed properly.

Note that this code resides in the Yggdrasil Decision Forests repository, which is a C++ library (developed by the same team as TF-DF). During compilation (with bazel), you'll have to make sure your local, modified copy of Yggdrasil Decision Forests is used. Finally, follow these instructions for building the old TF-serving with the old TF-DF. Unfortunately, our team does not have the bandwidth to support you in this process.

If this is too cumbersome, I suggest you also try to just exclude the request_id column from the model and measure model performance. Very often, a tree model is not able to discern useful information from such columns, especially if most entries are out-of-vocabulary (but this depends, of course, on your individual use case).

rstz commented

I thought about this a bit more and I believe this can be considered a bug and it's probably something we should address. I'll keep this issue open (and labeled) as a reminder.