Bug when running tabular.fit() and tabular.sample() with CPU
ChristinaChr opened this issue · 2 comments
Hello @avsolatorio,
There might be a bug when running tabular.fit() and tabular.sample() with device='cpu' (might also be a case in relational models, haven't tested).
I have trained a tabular model with CPU with a dataframe containing the columns in the following example. Their original data types were {integer_as_str: object[str], integer: int64, float: float64, boolean: bool, datetime: datetime64[ns], string: object[str]}.
integer_as_str | integer | float | boolean | datetime | string |
---|---|---|---|---|---|
03 | 6214 | 54.09 | false | 2002-10-15 03:07:53 | qyjib |
31 | 2997 | 39.15 | false | 1999-05-18 01:09:18 | mjuvv |
38 | 3362 | 52.91 | true | 1999-08-27 10:44:03 | ffskd |
47 | 2286 | 50.68 | false | 1999-02-02 05:48:06 | evqml |
24 | 14482 | 77.8 | true | 2001-09-08 13:56:20 | wieai |
In my case, I want to be able to generate only values that are present in the training data, indepedently of their type. In other words, I don't want to generate new values, that do not exist in training data.
In order to be able to achieve that, I have experimented with adding a letter in the beginning of each value (see transformation example below). What I was expecting was to see no new values in any of the columns. Instead, what I got were values of another data type (if we ignored a_, b_, etc). For example I got in datetime column a value of b_2997 (valid value but for another column!!), or I got in float column a value of e_1999-02-02 05:48:06 (again valid value but for another column!!)
integer_as_str | integer | float | boolean | datetime | string |
---|---|---|---|---|---|
a_03 | b_6214 | c_54.09 | d_false | e_2002-10-15 03:07:53 | f_qyjib |
a_31 | b_2997 | c_39.15 | d_false | e_1999-05-18 01:09:18 | f_mjuvv |
a_38 | b_3362 | c_52.91 | d_true | e_1999-08-27 10:44:03 | f_ffskd |
a_47 | b_2286 | c_50.68 | d_false | e_1999-02-02 05:48:06 | f_evqml |
a_24 | b_14482 | c_77.8 | d_true | e_2001-09-08 13:56:20 | f_wieai |
Let me note here, that everything works as expected when both tabular.fit() and tabular.sample() run with device='cuda'. What do you think of this? Maybe this is a bug that happens only with CPU?
Hello @ChristinaChr, this is interesting! Would you mind sharing a simple colab notebook that can reproduce this? Thank you!
Hello @avsolatorio,
Thanks for the quick response! I am attaching here a zip with the colab notebook, which has a working example for you to be able to reproduce. There is a section in the end where you can check if new values have been generated.