catboost/catboost

Text Features processing allocates too much memory (25G for 5M features like '0 1 1', '1 0 0 1' etc)

Closed this issue · 4 comments

Problem: I have dictionary of 2 words ('0' and '1') and only one feature (text), produced with this dictionary. Catboost allocates 25G RAM on the .fit(). Why?

code:

import pandas as pd
import numpy as np
from catboost import CatBoostClassifier, Pool

n = 5000000
X = pd.DataFrame((' '.join((str(np.random.randint(2)) for _ in range(np.random.randint(10)))) for _ in range(n)),
                 columns=['tf'])

Y = pd.Series(np.random.randint(2, size=n))

for_train = int(n * 0.7)
X_train, Y_train, X_val, Y_val = X[:for_train], Y[:for_train], X[for_train:], Y[for_train:]

model = CatBoostClassifier(iterations=2000, learning_rate=0.01,
                           depth=6,
                           max_ctr_complexity=2,
                           loss_function='Logloss',
                           task_type='GPU',
                           devices=[0],
                           dictionaries=['W:min_token_occurrence=5,max_dict_size=50000,token_level_type=Word'],
                           text_processing=['BoW+W'])

model.fit(X_train, Y_train, text_features=[0],
          eval_set=Pool(X_val, Y_val, text_features=[0]),
          early_stopping_rounds=20)

catboost version: 0.20
Operating System: Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-70-generic x86_64)
CPU: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz

GPU: GeForce GTX 1070

Thanks a lot for the issue, we are working on the fix now.

The fix is now in code and will be out in the next release. Thank you very much for pointing this out!

Using the code above on Colab results now in error:

CatBoostError Traceback (most recent call last)
in ()
7 ],
8 text_processing = [
----> 9 'NaiveBayes+Word|BoW+Word,BiGram|BM25+Word'
10 ],
11 )

3 frames
in fit_model(X_train, y_train, X_test, y_test, **kwargs)
23 train_pool,
24 eval_set=validation_pool,
---> 25 verbose=100,
26 )

/usr/local/lib/python3.6/dist-packages/catboost/core.py in fit(self, X, y, cat_features, text_features, embedding_features, sample_weight, baseline, use_best_model, eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period, silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model)
4296 self._fit(X, y, cat_features, text_features, embedding_features, None, sample_weight, None, None, None, None, baseline, use_best_model,
4297 eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period,
-> 4298 silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model)
4299 return self
4300

/usr/local/lib/python3.6/dist-packages/catboost/core.py in _fit(self, X, y, cat_features, text_features, embedding_features, pairs, sample_weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, use_best_model, eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period, silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model)
1795 use_best_model, eval_set, verbose, logging_level, plot,
1796 column_description, verbose_eval, metric_period, silent, early_stopping_rounds,
-> 1797 save_snapshot, snapshot_file, snapshot_interval, init_model
1798 )
1799 params = train_params["params"]

/usr/local/lib/python3.6/dist-packages/catboost/core.py in _prepare_train_params(self, X, y, cat_features, text_features, embedding_features, pairs, sample_weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, use_best_model, eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period, silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model)
1722 _check_param_types(params)
1723 params = _params_type_cast(params)
-> 1724 _check_train_params(params)
1725
1726 eval_set_list = eval_set if isinstance(eval_set, list) else [eval_set]

_catboost.pyx in _catboost._check_train_params()

_catboost.pyx in _catboost._check_train_params()

CatBoostError: catboost/private/libs/options/text_processing_options.cpp:356: You should provide either text_processing option or tokenizers, dictionaries, feature_calcers options.

Any idea why?

Catboost team has a bit changed the syntax during working on issue