NVIDIA-Merlin/NVTabular

[BUG] fitting XGBoost killed by OOM error

bilzard opened this issue · 4 comments

Describe the bug

Training XGBoost model with dataset larger than system memory killed by OOM error.

Error message

MemoryError: std::bad_alloc: out_of_memory: RMM failure at:/project/include/rmm/mr/device/pool_memory_resource.hpp:193: Maximum pool size exceeded

Steps/Code to reproduce bug

  • Dataset: 4,289,429 + 4,304,610 = 8,594,039 rows
  • columns: 112 entries
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4289429 entries, 0 to 4289428
Columns: 112 entries, type to n_i_h23_s
dtypes: float32(107), int32(3), int8(1), uint8(1)
memory usage: 1.8 GB

Code sample:

import numba

if numba.cuda.is_available():
    NUM_GPUS = list(range(len(numba.cuda.gpus)))
else:
    NUM_GPUS = []
visible_devices = ",".join(
    [str(n) for n in NUM_GPUS]
)  # Delect devices to place workers
device_limit_frac = 0.5  # Spill GPU-Worker memory to host at this limit.
device_pool_frac = 0.6
part_mem_frac = 0.01

# Use total device size to calculate args.device_limit_frac
device_size = device_mem_size(kind="total")
device_limit = int(device_limit_frac * device_size)
device_pool_size = int(device_pool_frac * device_size)
part_size = int(part_mem_frac * device_size)

# Check if any device memory is already occupied
for dev in visible_devices.split(","):
    fmem = pynvml_mem_size(kind="free", index=int(dev))
    used = (device_size - fmem) / 1e9
    if used > 1.0:
        warnings.warn(f"BEWARE - {used} GB is already occupied on device {int(dev)}!")

train_ds = Dataset(
    [f"train_data/{i}.parquet" for i in range(2)], engine="parquet", part_size="128MB"
)
valid_ds = Dataset(f"valid_data/0.parquet", engine="parquet", part_size="128MB")

wf = Workflow(feature_cols + target + qid_column)
train_processed = wf.fit_transform(train_ds)
valid_processed = wf.transform(valid_ds)

ranker = XGBoost(
    train_processed.schema,
    objective="rank:pairwise",
    eval_metric="ndcg@20",
    booster="dart",
)
xgb_train_params = {
    "num_boost_round": 1,
}

with LocalCUDACluster(
    n_workers=1,  # Number of GPU workers
    device_memory_limit=device_limit,  # GPU->CPU spill threshold (~75% of GPU memory)
    rmm_pool_size=(device_pool_size // 256) * 256,  # Memory pool size on each worker
    local_directory="/nvme/scratch/",  # Fast directory for disk spilling
) as cluster:
    client = Client(cluster)
    ranker.fit(
        train_processed,
        evals=[
            (valid_processed, "validation_set"),
        ],
        **xgb_train_params,
    )

Expected behavior

Dask client doesn't stop by OOM killer.

Environment details (please complete the following information):

  • Environment location: Docker
  • Method of NVTabular install: nvcr.io/nvidia/merlin/merlin-pytorch:22.12
  • GPU memory: 24GB
  • CPU memory: 64GB

Additional context
I followed an official tutorial1, and tried bellow both of which doesn't solved the issue.

  • set device_memory_limit as 50% of actual GPU memory
  • set row_group_size when creating input parquet file
    train.to_parquet(
        train_root / f"{i}.parquet",
        index=False,
        engine="pyarrow",
        row_group_size=100_000,
    )
  • If I loading just 1 parquet file, it didn't killed by OOM error.
train_ds = Dataset(
    [f"train_data/{i}.parquet" for i in range(1)], engine="parquet", part_size="128MB"
)

Footnotes

  1. https://nvidia-merlin.github.io/NVTabular/main/resources/troubleshooting.html?highlight=rmm_pool_size

rnyak commented

@bilzard which docker image are you using? you posted cudf == 22.12.0 but our latest docker images (e.g. merlin-tensorflow:22.12) should have cudf version 22.08.

Please check out the support matrix for the release versions: https://nvidia-merlin.github.io/Merlin/main/support_matrix/support_matrix_merlin_tensorflow.html

@rnyak Thank you for sharing link. I re-run my script on both of these docker image, and the problem reproduced.

  • nvcr.io/nvidia/merlin/merlin-pytorch:22.11
  • nvcr.io/nvidia/merlin/merlin-pytorch:22.12

Error log (merlin-pytorch:22.12)

2023-01-20 09:41:37,916 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-20 09:41:37,916 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.USER_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.USER: 'user'>, <Tags.ID: 'id'>].
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/xgboost/dask.py:884: RuntimeWarning: coroutine 'Client._wait_for_workers' was never awaited
  client.wait_for_workers(n_workers)
/usr/local/lib/python3.8/dist-packages/distributed/worker_state_machine.py:3468: FutureWarning: The `Worker.nthreads` attribute has been moved to `Worker.state.nthreads`
  warnings.warn(
[09:42:08] task [xgboost.dask-0]:tcp://127.0.0.1:45449 got new rank 0
2023-01-20 09:42:14,881 - distributed.worker - WARNING - Compute Failed
Key:       dispatched_train-3fcc8297-d94c-4735-bde7-d827a023737c
Function:  dispatched_train
args:      ({'eval_metric': 'ndcg@20', 'objective': 'rank:pairwise'}, [b'DMLC_NUM_WORKER=1', b'DMLC_TRACKER_URI=172.18.0.2', b'DMLC_TRACKER_PORT=48471', b'DMLC_TASK_ID=[xgboost.dask-0]:tcp://127.0.0.1:45449'], 140281164174192, ['validation_set'], [140281164173904], {'feature_names': None, 'feature_types': None, 'feature_weights': None, 'missing': None, 'enable_categorical': False, 'parts': [{'data':        type  priority       wgt  ...     n_i_h22_s     n_i_h23_s  session
8         1        11  0.037885  ...  1.000000e-09  1.000000e-09       78
17        1         6  0.054992  ...  1.000000e-09  1.000000e-09       78
5         1        16  0.032280  ...  1.000000e-09  1.000000e-09       78
19        1        14  0.035469  ...  1.000000e-09  1.000000e-09       78
4         1        13  0.036726  ...  1.000000e-09  1.000000e-09       78
...     ...       ...       ...  ...           ...           ...      ...
74342     2       116  0.016393  ...  7.692308e-11  7.692308e-11    85679
74306     2     
kwargs:    {}
Exception: "XGBoostError('[09:42:09] ../src/c_api/../data/../common/device_helpers.cuh:428: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory\\n- Free memory: 1238040576\\n- Requested memory: 4589462176\\n\\nStack trace:\\n  [bt] (0) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3a5799) [0x7f2b94c4f799]\\n  [bt] (1) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3a9bab) [0x7f2b94c53bab]\\n  [bt] (2) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x8dc49) [0x7f2b94937c49]\\n  [bt] (3) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3edbeb) [0x7f2b94c97beb]\\n  [bt] (4) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3ee759) [0x7f2b94c98759]\\n  [bt] (5) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3ee993) [0x7f2b94c98993]\\n  [bt] (6) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x45b7f9) [0x7f2b94d057f9]\\n  [bt] (7) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x45bec0) [0x7f2b94d05ec0]\\n  [bt] (8) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x421b40) [0x7f2b94ccbb40]\\n\\n')"

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
Cell In[7], line 8
      1 with LocalCUDACluster(
      2     n_workers=1,  # Number of GPU workers
      3     device_memory_limit=device_limit,  # GPU->CPU spill threshold (~75% of GPU memory)
      4     rmm_pool_size=(device_pool_size // 256) * 256,  # Memory pool size on each worker
      5     local_directory="/nvme/scratch/",  # Fast directory for disk spilling
      6 ) as cluster:
      7     client = Client(cluster)
----> 8     ranker.fit(
      9         train_processed,
     10         evals=[
     11             (valid_processed, "validation_set"),
     12         ],
     13         **xgb_train_params,
     14     )

File /usr/local/lib/python3.8/dist-packages/merlin/models/xgb/__init__.py:178, in XGBoost.fit(self, train, evals, use_quantile, **train_kwargs)
    175     d_eval = dmatrix_cls(self.dask_client, X, label=y, qid=qid)
    176     watchlist.append((d_eval, name))
--> 178 train_res = xgb.dask.train(
    179     self.dask_client, self.params, dtrain, evals=watchlist, **train_kwargs
    180 )
    181 self.booster: xgb.Booster = train_res["booster"]
    182 self.evals_result = train_res["history"]

File /usr/local/lib/python3.8/dist-packages/xgboost/core.py:575, in _deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
    573 for k, arg in zip(sig.parameters, args):
    574     kwargs[k] = arg
--> 575 return f(**kwargs)

File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:1072, in train(client, params, dtrain, num_boost_round, evals, obj, feval, early_stopping_rounds, xgb_model, verbose_eval, callbacks, custom_metric)
   1070 client = _xgb_get_client(client)
   1071 args = locals()
-> 1072 return client.sync(
   1073     _train_async,
   1074     global_config=config.get_config(),
   1075     dconfig=_get_dask_config(),
   1076     **args,
   1077 )

File /usr/local/lib/python3.8/dist-packages/distributed/utils.py:338, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    336     return future
    337 else:
--> 338     return sync(
    339         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    340     )

File /usr/local/lib/python3.8/dist-packages/distributed/utils.py:405, in sync(loop, func, callback_timeout, *args, **kwargs)
    403 if error:
    404     typ, exc, tb = error
--> 405     raise exc.with_traceback(tb)
    406 else:
    407     return result

File /usr/local/lib/python3.8/dist-packages/distributed/utils.py:378, in sync.<locals>.f()
    376         future = asyncio.wait_for(future, callback_timeout)
    377     future = asyncio.ensure_future(future)
--> 378     result = yield future
    379 except Exception:
    380     error = sys.exc_info()

File /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762, in Runner.run(self)
    759 exc_info = None
    761 try:
--> 762     value = future.result()
    763 except Exception:
    764     exc_info = sys.exc_info()

File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:1010, in _train_async(client, global_config, dconfig, params, dtrain, num_boost_round, evals, obj, feval, early_stopping_rounds, verbose_eval, xgb_model, callbacks, custom_metric)
   1007     evals_name = []
   1008     evals_id = []
-> 1010 results = await map_worker_partitions(
   1011     client,
   1012     dispatched_train,
   1013     params,
   1014     _rabit_args,
   1015     id(dtrain),
   1016     evals_name,
   1017     evals_id,
   1018     *([dtrain] + evals_data),
   1019     workers=workers,
   1020 )
   1021 return list(filter(lambda ret: ret is not None, results))[0]

File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:570, in map_worker_partitions(client, func, workers, *refs)
    566     fut = client.submit(
    567         func, *args, pure=False, workers=[addr], allow_other_workers=False
    568     )
    569     futures.append(fut)
--> 570 results = await client.gather(futures)
    571 return results

File /usr/local/lib/python3.8/dist-packages/distributed/client.py:2038, in Client._gather(self, futures, errors, direct, local_worker)
   2036         exc = CancelledError(key)
   2037     else:
-> 2038         raise exception.with_traceback(traceback)
   2039     raise exc
   2040 if errors == "skip":

File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:974, in dispatched_train()
    972         evals.append((Xy, evals_name[i]))
    973         continue
--> 974     eval_Xy = _dmatrix_from_list_of_parts(**ref, nthread=n_threads)
    975     evals.append((eval_Xy, evals_name[i]))
    977 booster = worker_train(
    978     params=local_param,
    979     dtrain=Xy,
   (...)
    989     callbacks=callbacks,
    990 )

File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:845, in _dmatrix_from_list_of_parts()
    843 if is_quantile:
    844     return _create_device_quantile_dmatrix(**kwargs)
--> 845 return _create_dmatrix(**kwargs)

File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:828, in _create_dmatrix()
    825     v = concat_or_none(value)
    826     concated_dict[key] = v
--> 828 dmatrix = DMatrix(
    829     **concated_dict,
    830     missing=missing,
    831     feature_names=feature_names,
    832     feature_types=feature_types,
    833     nthread=nthread,
    834     enable_categorical=enable_categorical,
    835     feature_weights=feature_weights,
    836 )
    837 return dmatrix

File /usr/local/lib/python3.8/dist-packages/xgboost/core.py:575, in inner_f()
    573 for k, arg in zip(sig.parameters, args):
    574     kwargs[k] = arg
--> 575 return f(**kwargs)

File /usr/local/lib/python3.8/dist-packages/xgboost/core.py:686, in __init__()
    683     assert self.handle is not None
    684     return
--> 686 handle, feature_names, feature_types = dispatch_data_backend(
    687     data,
    688     missing=self.missing,
    689     threads=self.nthread,
    690     feature_names=feature_names,
    691     feature_types=feature_types,
    692     enable_categorical=enable_categorical,
    693 )
    694 assert handle is not None
    695 self.handle = handle

File /usr/local/lib/python3.8/dist-packages/xgboost/data.py:896, in dispatch_data_backend()
    892     return _from_pandas_series(
    893         data, missing, threads, enable_categorical, feature_names, feature_types
    894     )
    895 if _is_cudf_df(data) or _is_cudf_ser(data):
--> 896     return _from_cudf_df(
    897         data, missing, threads, feature_names, feature_types, enable_categorical
    898     )
    899 if _is_cupy_array(data):
    900     return _from_cupy_array(data, missing, threads, feature_names,
    901                             feature_types)

File /usr/local/lib/python3.8/dist-packages/xgboost/data.py:691, in _from_cudf_df()
    689 handle = ctypes.c_void_p()
    690 config = bytes(json.dumps({"missing": missing, "nthread": nthread}), "utf-8")
--> 691 _check_call(
    692     _LIB.XGDMatrixCreateFromCudaColumnar(
    693         interfaces_str,
    694         config,
    695         ctypes.byref(handle),
    696     )
    697 )
    698 return handle, feature_names, feature_types

File /usr/local/lib/python3.8/dist-packages/xgboost/core.py:246, in _check_call()
    235 """Check the return value of C API call
    236 
    237 This function will raise exception when error occurs.
   (...)
    243     return value from API calls
    244 """
    245 if ret != 0:
--> 246     raise XGBoostError(py_str(_LIB.XGBGetLastError()))

XGBoostError: [09:42:09] ../src/c_api/../data/../common/device_helpers.cuh:428: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory
- Free memory: 1238040576
- Requested memory: 4589462176

Stack trace:
  [bt] (0) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3a5799) [0x7f2b94c4f799]
  [bt] (1) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3a9bab) [0x7f2b94c53bab]
  [bt] (2) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x8dc49) [0x7f2b94937c49]
  [bt] (3) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3edbeb) [0x7f2b94c97beb]
  [bt] (4) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3ee759) [0x7f2b94c98759]
  [bt] (5) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3ee993) [0x7f2b94c98993]
  [bt] (6) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x45b7f9) [0x7f2b94d057f9]
  [bt] (7) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x45bec0) [0x7f2b94d05ec0]
  [bt] (8) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x421b40) [0x7f2b94ccbb40]

2023-01-20 09:42:45,541 - distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client

Here is visualization of nvitop.
In the beginning, the GPU memory consumption is quite healty(~65%), however, suddenly the process required GPU memory close to 100% and died.

Screen Shot 2023-01-20 at 18 42 21

I believe this issue should be issued to this repo. So I closing this issue.
https://github.com/NVIDIA-Merlin/models