[BUG] fitting XGBoost killed by OOM error
bilzard opened this issue · 4 comments
Describe the bug
Training XGBoost model with dataset larger than system memory killed by OOM error.
Error message
MemoryError: std::bad_alloc: out_of_memory: RMM failure at:/project/include/rmm/mr/device/pool_memory_resource.hpp:193: Maximum pool size exceeded
Steps/Code to reproduce bug
- Dataset: 4,289,429 + 4,304,610 = 8,594,039 rows
- columns: 112 entries
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4289429 entries, 0 to 4289428
Columns: 112 entries, type to n_i_h23_s
dtypes: float32(107), int32(3), int8(1), uint8(1)
memory usage: 1.8 GB
Code sample:
import numba
if numba.cuda.is_available():
NUM_GPUS = list(range(len(numba.cuda.gpus)))
else:
NUM_GPUS = []
visible_devices = ",".join(
[str(n) for n in NUM_GPUS]
) # Delect devices to place workers
device_limit_frac = 0.5 # Spill GPU-Worker memory to host at this limit.
device_pool_frac = 0.6
part_mem_frac = 0.01
# Use total device size to calculate args.device_limit_frac
device_size = device_mem_size(kind="total")
device_limit = int(device_limit_frac * device_size)
device_pool_size = int(device_pool_frac * device_size)
part_size = int(part_mem_frac * device_size)
# Check if any device memory is already occupied
for dev in visible_devices.split(","):
fmem = pynvml_mem_size(kind="free", index=int(dev))
used = (device_size - fmem) / 1e9
if used > 1.0:
warnings.warn(f"BEWARE - {used} GB is already occupied on device {int(dev)}!")
train_ds = Dataset(
[f"train_data/{i}.parquet" for i in range(2)], engine="parquet", part_size="128MB"
)
valid_ds = Dataset(f"valid_data/0.parquet", engine="parquet", part_size="128MB")
wf = Workflow(feature_cols + target + qid_column)
train_processed = wf.fit_transform(train_ds)
valid_processed = wf.transform(valid_ds)
ranker = XGBoost(
train_processed.schema,
objective="rank:pairwise",
eval_metric="ndcg@20",
booster="dart",
)
xgb_train_params = {
"num_boost_round": 1,
}
with LocalCUDACluster(
n_workers=1, # Number of GPU workers
device_memory_limit=device_limit, # GPU->CPU spill threshold (~75% of GPU memory)
rmm_pool_size=(device_pool_size // 256) * 256, # Memory pool size on each worker
local_directory="/nvme/scratch/", # Fast directory for disk spilling
) as cluster:
client = Client(cluster)
ranker.fit(
train_processed,
evals=[
(valid_processed, "validation_set"),
],
**xgb_train_params,
)
Expected behavior
Dask client doesn't stop by OOM killer.
Environment details (please complete the following information):
- Environment location: Docker
- Method of NVTabular install:
nvcr.io/nvidia/merlin/merlin-pytorch:22.12
- GPU memory: 24GB
- CPU memory: 64GB
Additional context
I followed an official tutorial1, and tried bellow both of which doesn't solved the issue.
- set
device_memory_limit
as 50% of actual GPU memory - set
row_group_size
when creating input parquet file
train.to_parquet(
train_root / f"{i}.parquet",
index=False,
engine="pyarrow",
row_group_size=100_000,
)
- If I loading just 1 parquet file, it didn't killed by OOM error.
train_ds = Dataset(
[f"train_data/{i}.parquet" for i in range(1)], engine="parquet", part_size="128MB"
)
Footnotes
@bilzard which docker image are you using? you posted cudf == 22.12.0
but our latest docker images (e.g. merlin-tensorflow:22.12) should have cudf version 22.08.
Please check out the support matrix for the release versions: https://nvidia-merlin.github.io/Merlin/main/support_matrix/support_matrix_merlin_tensorflow.html
@rnyak Thank you for sharing link. I re-run my script on both of these docker image, and the problem reproduced.
- nvcr.io/nvidia/merlin/merlin-pytorch:22.11
- nvcr.io/nvidia/merlin/merlin-pytorch:22.12
Error log (merlin-pytorch:22.12
)
2023-01-20 09:41:37,916 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-20 09:41:37,916 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.USER_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.USER: 'user'>, <Tags.ID: 'id'>].
warnings.warn(
/usr/local/lib/python3.8/dist-packages/xgboost/dask.py:884: RuntimeWarning: coroutine 'Client._wait_for_workers' was never awaited
client.wait_for_workers(n_workers)
/usr/local/lib/python3.8/dist-packages/distributed/worker_state_machine.py:3468: FutureWarning: The `Worker.nthreads` attribute has been moved to `Worker.state.nthreads`
warnings.warn(
[09:42:08] task [xgboost.dask-0]:tcp://127.0.0.1:45449 got new rank 0
2023-01-20 09:42:14,881 - distributed.worker - WARNING - Compute Failed
Key: dispatched_train-3fcc8297-d94c-4735-bde7-d827a023737c
Function: dispatched_train
args: ({'eval_metric': 'ndcg@20', 'objective': 'rank:pairwise'}, [b'DMLC_NUM_WORKER=1', b'DMLC_TRACKER_URI=172.18.0.2', b'DMLC_TRACKER_PORT=48471', b'DMLC_TASK_ID=[xgboost.dask-0]:tcp://127.0.0.1:45449'], 140281164174192, ['validation_set'], [140281164173904], {'feature_names': None, 'feature_types': None, 'feature_weights': None, 'missing': None, 'enable_categorical': False, 'parts': [{'data': type priority wgt ... n_i_h22_s n_i_h23_s session
8 1 11 0.037885 ... 1.000000e-09 1.000000e-09 78
17 1 6 0.054992 ... 1.000000e-09 1.000000e-09 78
5 1 16 0.032280 ... 1.000000e-09 1.000000e-09 78
19 1 14 0.035469 ... 1.000000e-09 1.000000e-09 78
4 1 13 0.036726 ... 1.000000e-09 1.000000e-09 78
... ... ... ... ... ... ... ...
74342 2 116 0.016393 ... 7.692308e-11 7.692308e-11 85679
74306 2
kwargs: {}
Exception: "XGBoostError('[09:42:09] ../src/c_api/../data/../common/device_helpers.cuh:428: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory\\n- Free memory: 1238040576\\n- Requested memory: 4589462176\\n\\nStack trace:\\n [bt] (0) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3a5799) [0x7f2b94c4f799]\\n [bt] (1) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3a9bab) [0x7f2b94c53bab]\\n [bt] (2) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x8dc49) [0x7f2b94937c49]\\n [bt] (3) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3edbeb) [0x7f2b94c97beb]\\n [bt] (4) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3ee759) [0x7f2b94c98759]\\n [bt] (5) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3ee993) [0x7f2b94c98993]\\n [bt] (6) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x45b7f9) [0x7f2b94d057f9]\\n [bt] (7) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x45bec0) [0x7f2b94d05ec0]\\n [bt] (8) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x421b40) [0x7f2b94ccbb40]\\n\\n')"
---------------------------------------------------------------------------
XGBoostError Traceback (most recent call last)
Cell In[7], line 8
1 with LocalCUDACluster(
2 n_workers=1, # Number of GPU workers
3 device_memory_limit=device_limit, # GPU->CPU spill threshold (~75% of GPU memory)
4 rmm_pool_size=(device_pool_size // 256) * 256, # Memory pool size on each worker
5 local_directory="/nvme/scratch/", # Fast directory for disk spilling
6 ) as cluster:
7 client = Client(cluster)
----> 8 ranker.fit(
9 train_processed,
10 evals=[
11 (valid_processed, "validation_set"),
12 ],
13 **xgb_train_params,
14 )
File /usr/local/lib/python3.8/dist-packages/merlin/models/xgb/__init__.py:178, in XGBoost.fit(self, train, evals, use_quantile, **train_kwargs)
175 d_eval = dmatrix_cls(self.dask_client, X, label=y, qid=qid)
176 watchlist.append((d_eval, name))
--> 178 train_res = xgb.dask.train(
179 self.dask_client, self.params, dtrain, evals=watchlist, **train_kwargs
180 )
181 self.booster: xgb.Booster = train_res["booster"]
182 self.evals_result = train_res["history"]
File /usr/local/lib/python3.8/dist-packages/xgboost/core.py:575, in _deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
573 for k, arg in zip(sig.parameters, args):
574 kwargs[k] = arg
--> 575 return f(**kwargs)
File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:1072, in train(client, params, dtrain, num_boost_round, evals, obj, feval, early_stopping_rounds, xgb_model, verbose_eval, callbacks, custom_metric)
1070 client = _xgb_get_client(client)
1071 args = locals()
-> 1072 return client.sync(
1073 _train_async,
1074 global_config=config.get_config(),
1075 dconfig=_get_dask_config(),
1076 **args,
1077 )
File /usr/local/lib/python3.8/dist-packages/distributed/utils.py:338, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
336 return future
337 else:
--> 338 return sync(
339 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
340 )
File /usr/local/lib/python3.8/dist-packages/distributed/utils.py:405, in sync(loop, func, callback_timeout, *args, **kwargs)
403 if error:
404 typ, exc, tb = error
--> 405 raise exc.with_traceback(tb)
406 else:
407 return result
File /usr/local/lib/python3.8/dist-packages/distributed/utils.py:378, in sync.<locals>.f()
376 future = asyncio.wait_for(future, callback_timeout)
377 future = asyncio.ensure_future(future)
--> 378 result = yield future
379 except Exception:
380 error = sys.exc_info()
File /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762, in Runner.run(self)
759 exc_info = None
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:1010, in _train_async(client, global_config, dconfig, params, dtrain, num_boost_round, evals, obj, feval, early_stopping_rounds, verbose_eval, xgb_model, callbacks, custom_metric)
1007 evals_name = []
1008 evals_id = []
-> 1010 results = await map_worker_partitions(
1011 client,
1012 dispatched_train,
1013 params,
1014 _rabit_args,
1015 id(dtrain),
1016 evals_name,
1017 evals_id,
1018 *([dtrain] + evals_data),
1019 workers=workers,
1020 )
1021 return list(filter(lambda ret: ret is not None, results))[0]
File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:570, in map_worker_partitions(client, func, workers, *refs)
566 fut = client.submit(
567 func, *args, pure=False, workers=[addr], allow_other_workers=False
568 )
569 futures.append(fut)
--> 570 results = await client.gather(futures)
571 return results
File /usr/local/lib/python3.8/dist-packages/distributed/client.py:2038, in Client._gather(self, futures, errors, direct, local_worker)
2036 exc = CancelledError(key)
2037 else:
-> 2038 raise exception.with_traceback(traceback)
2039 raise exc
2040 if errors == "skip":
File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:974, in dispatched_train()
972 evals.append((Xy, evals_name[i]))
973 continue
--> 974 eval_Xy = _dmatrix_from_list_of_parts(**ref, nthread=n_threads)
975 evals.append((eval_Xy, evals_name[i]))
977 booster = worker_train(
978 params=local_param,
979 dtrain=Xy,
(...)
989 callbacks=callbacks,
990 )
File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:845, in _dmatrix_from_list_of_parts()
843 if is_quantile:
844 return _create_device_quantile_dmatrix(**kwargs)
--> 845 return _create_dmatrix(**kwargs)
File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:828, in _create_dmatrix()
825 v = concat_or_none(value)
826 concated_dict[key] = v
--> 828 dmatrix = DMatrix(
829 **concated_dict,
830 missing=missing,
831 feature_names=feature_names,
832 feature_types=feature_types,
833 nthread=nthread,
834 enable_categorical=enable_categorical,
835 feature_weights=feature_weights,
836 )
837 return dmatrix
File /usr/local/lib/python3.8/dist-packages/xgboost/core.py:575, in inner_f()
573 for k, arg in zip(sig.parameters, args):
574 kwargs[k] = arg
--> 575 return f(**kwargs)
File /usr/local/lib/python3.8/dist-packages/xgboost/core.py:686, in __init__()
683 assert self.handle is not None
684 return
--> 686 handle, feature_names, feature_types = dispatch_data_backend(
687 data,
688 missing=self.missing,
689 threads=self.nthread,
690 feature_names=feature_names,
691 feature_types=feature_types,
692 enable_categorical=enable_categorical,
693 )
694 assert handle is not None
695 self.handle = handle
File /usr/local/lib/python3.8/dist-packages/xgboost/data.py:896, in dispatch_data_backend()
892 return _from_pandas_series(
893 data, missing, threads, enable_categorical, feature_names, feature_types
894 )
895 if _is_cudf_df(data) or _is_cudf_ser(data):
--> 896 return _from_cudf_df(
897 data, missing, threads, feature_names, feature_types, enable_categorical
898 )
899 if _is_cupy_array(data):
900 return _from_cupy_array(data, missing, threads, feature_names,
901 feature_types)
File /usr/local/lib/python3.8/dist-packages/xgboost/data.py:691, in _from_cudf_df()
689 handle = ctypes.c_void_p()
690 config = bytes(json.dumps({"missing": missing, "nthread": nthread}), "utf-8")
--> 691 _check_call(
692 _LIB.XGDMatrixCreateFromCudaColumnar(
693 interfaces_str,
694 config,
695 ctypes.byref(handle),
696 )
697 )
698 return handle, feature_names, feature_types
File /usr/local/lib/python3.8/dist-packages/xgboost/core.py:246, in _check_call()
235 """Check the return value of C API call
236
237 This function will raise exception when error occurs.
(...)
243 return value from API calls
244 """
245 if ret != 0:
--> 246 raise XGBoostError(py_str(_LIB.XGBGetLastError()))
XGBoostError: [09:42:09] ../src/c_api/../data/../common/device_helpers.cuh:428: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory
- Free memory: 1238040576
- Requested memory: 4589462176
Stack trace:
[bt] (0) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3a5799) [0x7f2b94c4f799]
[bt] (1) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3a9bab) [0x7f2b94c53bab]
[bt] (2) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x8dc49) [0x7f2b94937c49]
[bt] (3) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3edbeb) [0x7f2b94c97beb]
[bt] (4) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3ee759) [0x7f2b94c98759]
[bt] (5) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3ee993) [0x7f2b94c98993]
[bt] (6) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x45b7f9) [0x7f2b94d057f9]
[bt] (7) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x45bec0) [0x7f2b94d05ec0]
[bt] (8) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x421b40) [0x7f2b94ccbb40]
2023-01-20 09:42:45,541 - distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
I believe this issue should be issued to this repo. So I closing this issue.
https://github.com/NVIDIA-Merlin/models