catboost/catboost

Fitting with text features uses too much memory (~35GB for 5M examples)

jonasmaertens opened this issue · 0 comments

Problem: When fitting MultiClass CatBoostClassifiers on relatively large datasets with text features, the memory usage is extremely high (about 35GB for 5M examples with a dictionary size of 100.

#1107 describes the same problem but should be fixed.

catboost version: 1.2
Operating System: MacOS Sonoma 14.3
CPU: M1 Max

Code to reproduce:

import pandas as pd
import numpy as np
from catboost import CatBoostClassifier, Pool

n = 5_000_000
X = pd.DataFrame((' '.join((str(np.random.randint(200)) for _ in range(np.random.randint(20, 30)))) for _ in range(n)),
                 columns=['tf'])

Y = pd.Series(np.random.randint(100, size=n))

print(X.head())
print(Y.head())

for_train = int(n * 0.7)
X_train, Y_train, X_val, Y_val = X[:for_train], Y[:for_train], X[for_train:], Y[for_train:]

model = CatBoostClassifier(iterations=2000, learning_rate=0.01,
                           depth=6,
                           max_ctr_complexity=2,
                           loss_function='MultiClass',
                           task_type='CPU',
                           devices=[0],
                           text_processing={
                               "tokenizers": [
                                   {'tokenizer_id': 'Space', 'separator_type': 'ByDelimiter', 'delimiter': ' ',
                                    'token_types': ['Word']},
                               ],
                               "dictionaries": [
                                   {'dictionary_id': 'Unigram', 'max_dictionary_size': '100',
                                    'token_level_type': 'Word',
                                    'gram_order': '1', 'occurrence_lower_bound': '50', 'skip_unknown': 'true',
                                    'gram_count': '1'}
                               ],
                               "feature_processing": {
                                   "default": [
                                       {'dictionaries_names': ['Unigram'], 'feature_calcers': ['BoW'],
                                        'tokenizers_names': ['Space']}
                                   ]}
                           })

model.fit(X_train, Y_train, text_features=[0],
          eval_set=Pool(X_val, Y_val, text_features=[0]),
          early_stopping_rounds=20)