sberbank-ai-lab/LightAutoML

TabularUtilizedAutoML for Multiclass

darenr opened this issue · 4 comments

Apologies for what might be a stupid question -

    automl = TabularUtilizedAutoML(task=Task('multiclass')...)

    oof_pred = automl.fit_predict(
        df_train, roles={"target": TARGET_NAME}
    )

    test_pred = automl.predict(df_test)

The test_pred is in the predict_proba format, but what i'm looking for is for test_pred to have a method that will give me the label of the prediction, for Iris this would be one of Setosa/Virginica/Versicolor, the automl is automatically encoding the data frame during training for the target, is there a way to get the prediction label please? I can't find an attribute in automl like _classes either.

Is there a way please to call automl.predict and get back the prediction label?

Hi @darenr,

LightAutoML always works with the data and shows class probabilities as the output.

It was our decision to make it like this because of the idea that the thresholds for the classes depend on many different things - business needs (it must be 0.5 in any case or can be different), task restrictions (confidence rates like "if probability bigger than 0.7 - take that class, if less - show 'others' class) etc. AutoML model doesn't know the meaning of the classes so it can work with points above properly in human meaning, that's why we don't include the method which converts probabilities into class labels predictions (like predict from sklearn do).

As for labels itself I need to say several moments you should know about LightAutoML. If the initial class labels are not like 0-1-2-..-(N-1) - we automatically do class labels encoding based on their frequency (the more frequent is the label, the less integer value it has in encoding). For example, if the frequency order is like you said Setosa/Virginica/Versicolor - the encoding will be 0 for Setosa, 1 for Virginica and 2 for Versicolor. So here you have 2 options: "make encoding into 0-1-..-N beforehand manually in the way you want it" or "give it to LightAutoML to work with". If you use first variant - you know the encoding, but if you goes with second way how to know the labels?

The class mapping is always stored in Reader object inside trained LIghtAutoML model. It has a dict with mapping if the mapping was done and None if not. To take it from the model there are 2 options depending what preset you use:

  • for TabularAutoML it is simple - if the trained model called automl, the mapping is automl.reader.class_mapping and you see the dict or None
  • for TabularUtilizedAutoML (as it is a composition of TabularAutoMLs) it is a little bit harder - automl.outer_pipes[0].ml_algos[0].models[0].reader.class_mapping, as we go to the inner automl model and take its Reader property.

So if we take all the above, you can work with LightAutoML as usual and after that get class mappings and use for example argmax by rows on test_pred.data (this is the numpy prediction - the test_pred itself has the NumpyDataset type from our library) if it is ok with the business needs of your task.

Hope this helps and sorry for the long read :)

Alex

thanks for the detail I really appreciate you taking the time, that makes total sense.

Using the MWE from #57 I actually had to add an extra index:

automl.outer_pipes[0].ml_algos[0].models[0][0].reader.class_mapping
{'a': 0, 'b': 1, 'c': 2}

because automl.outer_pipes[0].ml_algos[0].models[0] was a list:

type(automl.outer_pipes[0].ml_algos[0].models[0])
<class 'list'>

Was this a mistake, an update (I am using 0.2.16), or non-deterministic (e.g. depending on the final model)?

I still hope a more easily accessible property for this is under consideration, one that's also consistent because currently there are different behaviors (from experimenting with 0.2.16):

  • str input is encoded, resulting in a class map based on frequency
  • numeric input is encoded if the numbers are not sequential, in which case the class map is based on frequency
  • numeric input is not encoded if the numbers are sequential, which mean the most-frequent class is not necessarily in the first column and there is no class map (which would've been the identity function)
  • boolean input is encoded but no map is present and are always encoded as 0: False, 1: True, regardless of order or frequency

It looks like ties are broken based on first-appearance in the data.

Making sure there's an easily accessible attribute from which to read the mapping (even if in some cases it's an identity mapping), will make it much easier to write data-agnostic code.