Fitting with text features uses too much memory (~35GB for 5M examples)
jonasmaertens opened this issue · 0 comments
jonasmaertens commented
Problem: When fitting MultiClass CatBoostClassifiers on relatively large datasets with text features, the memory usage is extremely high (about 35GB for 5M examples with a dictionary size of 100.
#1107 describes the same problem but should be fixed.
catboost version: 1.2
Operating System: MacOS Sonoma 14.3
CPU: M1 Max
Code to reproduce:
import pandas as pd
import numpy as np
from catboost import CatBoostClassifier, Pool
n = 5_000_000
X = pd.DataFrame((' '.join((str(np.random.randint(200)) for _ in range(np.random.randint(20, 30)))) for _ in range(n)),
columns=['tf'])
Y = pd.Series(np.random.randint(100, size=n))
print(X.head())
print(Y.head())
for_train = int(n * 0.7)
X_train, Y_train, X_val, Y_val = X[:for_train], Y[:for_train], X[for_train:], Y[for_train:]
model = CatBoostClassifier(iterations=2000, learning_rate=0.01,
depth=6,
max_ctr_complexity=2,
loss_function='MultiClass',
task_type='CPU',
devices=[0],
text_processing={
"tokenizers": [
{'tokenizer_id': 'Space', 'separator_type': 'ByDelimiter', 'delimiter': ' ',
'token_types': ['Word']},
],
"dictionaries": [
{'dictionary_id': 'Unigram', 'max_dictionary_size': '100',
'token_level_type': 'Word',
'gram_order': '1', 'occurrence_lower_bound': '50', 'skip_unknown': 'true',
'gram_count': '1'}
],
"feature_processing": {
"default": [
{'dictionaries_names': ['Unigram'], 'feature_calcers': ['BoW'],
'tokenizers_names': ['Space']}
]}
})
model.fit(X_train, Y_train, text_features=[0],
eval_set=Pool(X_val, Y_val, text_features=[0]),
early_stopping_rounds=20)