Text Features processing allocates too much memory (25G for 5M features like '0 1 1', '1 0 0 1' etc)

Question

Text Features processing allocates too much memory (25G for 5M features like '0 1 1', '1 0 0 1' etc)

Closed this issue 4 years ago · 4 comments

Problem: I have dictionary of 2 words ('0' and '1') and only one feature (text), produced with this dictionary. Catboost allocates 25G RAM on the .fit(). Why?

code:

import pandas as pd
import numpy as np
from catboost import CatBoostClassifier, Pool

n = 5000000
X = pd.DataFrame((' '.join((str(np.random.randint(2)) for _ in range(np.random.randint(10)))) for _ in range(n)),
                 columns=['tf'])

Y = pd.Series(np.random.randint(2, size=n))

for_train = int(n * 0.7)
X_train, Y_train, X_val, Y_val = X[:for_train], Y[:for_train], X[for_train:], Y[for_train:]

model = CatBoostClassifier(iterations=2000, learning_rate=0.01,
                           depth=6,
                           max_ctr_complexity=2,
                           loss_function='Logloss',
                           task_type='GPU',
                           devices=[0],
                           dictionaries=['W:min_token_occurrence=5,max_dict_size=50000,token_level_type=Word'],
                           text_processing=['BoW+W'])

model.fit(X_train, Y_train, text_features=[0],
          eval_set=Pool(X_val, Y_val, text_features=[0]),
          early_stopping_rounds=20)

catboost version: 0.20
Operating System: Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-70-generic x86_64)
CPU: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz

GPU: GeForce GTX 1070

Answer 1 · 2019-12-24T09:05:25.000Z

Thanks a lot for the issue, we are working on the fix now.

Answer 2 · 2020-01-21T14:18:35.000Z

The fix is now in code and will be out in the next release. Thank you very much for pointing this out!

Answer 3 · 2020-10-23T11:17:30.000Z

Using the code above on Colab results now in error:

CatBoostError Traceback (most recent call last)
in ()
7 ],
8 text_processing = [
----> 9 'NaiveBayes+Word|BoW+Word,BiGram|BM25+Word'
10 ],
11 )

3 frames
in fit_model(X_train, y_train, X_test, y_test, **kwargs)
23 train_pool,
24 eval_set=validation_pool,
---> 25 verbose=100,
26 )

/usr/local/lib/python3.6/dist-packages/catboost/core.py in fit(self, X, y, cat_features, text_features, embedding_features, sample_weight, baseline, use_best_model, eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period, silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model)
4296 self._fit(X, y, cat_features, text_features, embedding_features, None, sample_weight, None, None, None, None, baseline, use_best_model,
4297 eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period,
-> 4298 silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model)
4299 return self
4300

/usr/local/lib/python3.6/dist-packages/catboost/core.py in _fit(self, X, y, cat_features, text_features, embedding_features, pairs, sample_weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, use_best_model, eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period, silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model)
1795 use_best_model, eval_set, verbose, logging_level, plot,
1796 column_description, verbose_eval, metric_period, silent, early_stopping_rounds,
-> 1797 save_snapshot, snapshot_file, snapshot_interval, init_model
1798 )
1799 params = train_params["params"]

/usr/local/lib/python3.6/dist-packages/catboost/core.py in _prepare_train_params(self, X, y, cat_features, text_features, embedding_features, pairs, sample_weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, use_best_model, eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period, silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model)
1722 _check_param_types(params)
1723 params = _params_type_cast(params)
-> 1724 _check_train_params(params)
1725
1726 eval_set_list = eval_set if isinstance(eval_set, list) else [eval_set]

_catboost.pyx in _catboost._check_train_params()

CatBoostError: catboost/private/libs/options/text_processing_options.cpp:356: You should provide either text_processing option or tokenizers, dictionaries, feature_calcers options.

Any idea why?

Answer 4 · 2020-10-23T12:15:58.000Z

Catboost team has a bit changed the syntax during working on issue