TabularUtilizedAutoML for Multiclass
darenr opened this issue · 4 comments
Apologies for what might be a stupid question -
automl = TabularUtilizedAutoML(task=Task('multiclass')...)
oof_pred = automl.fit_predict(
df_train, roles={"target": TARGET_NAME}
)
test_pred = automl.predict(df_test)
The test_pred
is in the predict_proba
format, but what i'm looking for is for test_pred
to have a method that will give me the label of the prediction, for Iris this would be one of Setosa/Virginica/Versicolor, the automl
is automatically encoding the data frame during training for the target, is there a way to get the prediction label please? I can't find an attribute in automl
like _classes
either.
Is there a way please to call automl.predict
and get back the prediction label?
Hi @darenr,
LightAutoML always works with the data and shows class probabilities as the output.
It was our decision to make it like this because of the idea that the thresholds for the classes depend on many different things - business needs (it must be 0.5 in any case or can be different), task restrictions (confidence rates like "if probability bigger than 0.7 - take that class, if less - show 'others' class) etc. AutoML model doesn't know the meaning of the classes so it can work with points above properly in human meaning, that's why we don't include the method which converts probabilities into class labels predictions (like predict
from sklearn do).
As for labels itself I need to say several moments you should know about LightAutoML. If the initial class labels are not like 0-1-2-..-(N-1) - we automatically do class labels encoding based on their frequency (the more frequent is the label, the less integer value it has in encoding). For example, if the frequency order is like you said Setosa/Virginica/Versicolor - the encoding will be 0 for Setosa, 1 for Virginica and 2 for Versicolor. So here you have 2 options: "make encoding into 0-1-..-N beforehand manually in the way you want it" or "give it to LightAutoML to work with". If you use first variant - you know the encoding, but if you goes with second way how to know the labels?
The class mapping is always stored in Reader object inside trained LIghtAutoML model. It has a dict with mapping if the mapping was done and None if not. To take it from the model there are 2 options depending what preset you use:
- for
TabularAutoML
it is simple - if the trained model calledautoml
, the mapping isautoml.reader.class_mapping
and you see the dict or None - for
TabularUtilizedAutoML
(as it is a composition of TabularAutoMLs) it is a little bit harder -automl.outer_pipes[0].ml_algos[0].models[0].reader.class_mapping
, as we go to the inner automl model and take its Reader property.
So if we take all the above, you can work with LightAutoML as usual and after that get class mappings and use for example argmax by rows on test_pred.data
(this is the numpy prediction - the test_pred
itself has the NumpyDataset type from our library) if it is ok with the business needs of your task.
Hope this helps and sorry for the long read :)
Alex
thanks for the detail I really appreciate you taking the time, that makes total sense.
Using the MWE from #57 I actually had to add an extra index:
automl.outer_pipes[0].ml_algos[0].models[0][0].reader.class_mapping
{'a': 0, 'b': 1, 'c': 2}
because automl.outer_pipes[0].ml_algos[0].models[0]
was a list:
type(automl.outer_pipes[0].ml_algos[0].models[0])
<class 'list'>
Was this a mistake, an update (I am using 0.2.16
), or non-deterministic (e.g. depending on the final model)?
I still hope a more easily accessible property for this is under consideration, one that's also consistent because currently there are different behaviors (from experimenting with 0.2.16
):
str
input is encoded, resulting in a class map based on frequency- numeric input is encoded if the numbers are not sequential, in which case the class map is based on frequency
- numeric input is not encoded if the numbers are sequential, which mean the most-frequent class is not necessarily in the first column and there is no class map (which would've been the identity function)
- boolean input is encoded but no map is present and are always encoded as
0: False, 1: True
, regardless of order or frequency
It looks like ties are broken based on first-appearance in the data.
Making sure there's an easily accessible attribute from which to read the mapping (even if in some cases it's an identity mapping), will make it much easier to write data-agnostic code.