catboost/catboost

CatBoostError: catboost/libs/metrics/metric.cpp:6405: All train targets are equal

alitirmizi23 opened this issue · 7 comments

Problem: Training a multilabel binary classifier throws error: All train targets are equal. This does happen in some samples (where all the 5 labels are class 1, but not in all samples). Shouldn't it work anyway?

catboost version: 1.0.3
Operating System: Ubuntu 20.04.2 LTS
CPU: Intel® Core™ i7-10875H CPU @ 2.30GHz × 16
GPU: GeForce RTX 2080 Super with Max-Q Design

My y_train array contain labels in the following form:


array([[1, 1, 0, 0, 0],
       [1, 1, 1, 1, 0],
       [1, 1, 1, 1, 0],
       ...,
       [1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 0, 0]], dtype=uint8)

train_pool = Pool(X_train, label=y_train,text_features=[0])

clf = CatBoostClassifier(
    loss_function='MultiLogloss',
    text_features=[0]
)
clf.fit(train_pool, plot=True)

it doesn't fit and eventually throws an error saying all train targets are equal.

I have now removed the first column from the array and now getting a different error: CatBoostError: catboost/libs/data/target.h:315: Attempt to use multidimintional target as one-dimensional

NOTE: My features are "text" features and I'm trying to let catboost handle them

Hi @alitirmizi23!
Multi-label classification with text features is not implemented at the moment.
We need to do more research on how to transform text features to floats in this case.

Multi-label classification with text features is not implemented at the moment.

Hello, @Evgueni-Petrov-aka-espetrov!
Could you please tell if embedding_features aren't supported in multi-label classification either? It seems that I'm getting the same error with them. Maybe there are some plans to research this issue too?

Here is a minimal example to reproduce it:

import pandas as pd
import catboost

features_df = pd.DataFrame({
    'f0': [0, 1, 2],
    'embedding1': [[0.0, 1.1, 0.0], [1.0, 2.1, 0.0], [0.11, 0.2, 0.3]]
})
label = [[0, 0], [0, 1], [1, 0]]

pool = catboost.Pool(features_df, label=label, embedding_features=['embedding1'])

model = catboost.CatBoostClassifier(iterations=10, loss_function='MultiLogloss')
model.fit(pool)

I am also getting the same error: CatBoostError: C:/Program Files (x86)/Go Agent/pipelines/BuildMaster/catboost.git/catboost/libs/metrics/metric.cpp:6376: All train targets are equal

My training data set contains, 9897 rows with input 4098 binary input features and multilable classification of 3500 labels as Y.

Thoughts / hint: I guess some issues happening in the data transformation with pool function, which generate get_weight, all weights are same for all objects

Same issue here. Followed the tutorial but I'm not able to make it work.
Did you find a workaround @bala2engine ?

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y_bin, test_size=0.2, shuffle=True, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((22567, 13208), (5642, 13208), (22567, 533), (5642, 533))

# Initialize the CatBoost model
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=10, loss_function='MultiLogloss', class_names=mlb.classes_, used_ram_limit='5gb')

# Fit the model on the transformed training data
model.fit(train_pool, eval_set=[train_pool, test_pool], metric_period=10, plot=True)

CatBoostError: C:/Program Files (x86)/Go Agent/pipelines/BuildMaster/catboost.git/catboost/libs/metrics/metric.cpp:6376: All train targets are equal

@Karlheinzniebuhr @mikrut @bala2engine
because of current implementation, multi-label classification does not work with texts and embeddings -- @andrey-khropov may provide more details

@Karlheinzniebuhr
exception All train targets are equal means that some label column is useless for training because it contains the same class -- just remove such columns from the training dataset

i agree that this is strange, but the following code does print the following text.

label 99 is single class 0
label 524 is single class 0
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split

X, Y = make_multilabel_classification(n_samples=22567, n_features=20, n_classes=533, random_state=0)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
for i in range(533):
  if min(Y_train[:, i]) == max(Y_train[:, i]):
    print('label ' + str(i) + ' is single class ' + str(Y_train[0, i]))