NVIDIA-Merlin/NVTabular

[BUG] Multi-GPU training failing during data loading: tabulate: failed to synchronize: cudaErrorIllegalAddress

srastatter opened this issue · 0 comments

Describe the bug
Memory access issues are encountered when attempting to train a Merlin two tower recommender with multiple GPUs. When running with 1 GPU, the training process completes successful, and can verify using nvidia-smi that training is indeed using the GPU. On this same machine, I can also successfully run horovod distributed training jobs across multiple GPUs using tensorflow datasets - here is example script that is working.

However, with nvtabular datasets, memory access errors occur when adding in more than 1 GPU. The first GPU is able to access the dataset and run through up to training the model, but when the 2nd GPU attempts to access data via the dataloader, a cudaErrorIllegalAddress error is thrown. The error and stack trace appears to be similar in nature to this closed bug, however, the resolution in the closed bug does not solve the error here.

Below is a stack trace of the error - note that the process on the first GPU denoted by [1,0] continues up to the beginning of training, but the second GPU [1,1] attempts to access what would appear to be de-allocated memory of some kind:

[1,0]<stdout>:[SOK INFO] Import /usr/local/lib/python3.10/dist-packages/merlin_sok-2.0.0-py3.10-linux-x86_64.egg/sparse_operation_kit/lib/libsparse_operation_kit.so
[1,1]<stdout>:[SOK INFO] Import /usr/local/lib/python3.10/dist-packages/merlin_sok-2.0.0-py3.10-linux-x86_64.egg/sparse_operation_kit/lib/libsparse_operation_kit.so
[1,0]<stdout>:[SOK INFO] Initialize finished, communication tool: horovod
[1,1]<stdout>:[SOK INFO] Initialize finished, communication tool: horovod
[1,0]<stderr>:2024-06-14 18:34:46.325291: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
[1,1]<stderr>:2024-06-14 18:34:46.445355: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
[1,1]<stderr>:User function raise error: tabulate: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encounteredTraceback (most recent call last):
[1,1]<stderr>:  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[1,1]<stderr>:	return _run_code(code, main_globals, None,
[1,1]<stderr>:  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[1,1]<stderr>:	exec(code, run_globals)
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/horovod/runner/run_task.py", line 37, in <module>
[1,1]<stderr>:	main(driver_addr, run_func_server_port)
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/horovod/runner/run_task.py", line 28, in main
[1,1]<stderr>:	raise e
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/horovod/runner/run_task.py", line 25, in main
[1,1]<stderr>:	ret_val = func()
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/horovod/runner/__init__.py", line 215, in wrapped_func
[1,1]<stderr>:	return func(*args, **kwargs)
[1,1]<stderr>:  File "/home/jupyter/pznpfm-recommender-system/recommender_system/models/retriever/examples/train.py", line 131, in main
[1,1]<stderr>:	model.fit(train,
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/merlin/models/tf/models/base.py", line 1416, in fit
[1,1]<stderr>:	out = super().fit(**fit_kwargs)
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler
[1,1]<stderr>:	raise e.with_traceback(filtered_tb) from None
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/tensorflow.py", line 93, in __getitem__
[1,1]<stderr>:	return self.__next__()
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/tensorflow.py", line 97, in __next__
[1,1]<stderr>:	converted_batch = self.convert_batch(super().__next__())
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 261, in __next__
[1,1]<stderr>:	return self._get_next_batch()
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 328, in _get_next_batch
[1,1]<stderr>:	self._fetch_chunk()
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 277, in _fetch_chunk
[1,1]<stderr>:	raise chunks
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 791, in load_chunks
[1,1]<stderr>:	self.chunk_logic(itr)
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
[1,1]<stderr>:	result = func(*args, **kwargs)
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 767, in chunk_logic
[1,1]<stderr>:	chunks.reset_index(drop=True, inplace=True)
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/cudf/core/dataframe.py", line 2984, in reset_index
[1,1]<stderr>:	*self._reset_index(
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/cudf/core/indexed_frame.py", line 2929, in _reset_index
[1,1]<stderr>:	) = self._index._split_columns_by_levels(level)
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/cudf/core/_base_index.py", line 1749, in _split_columns_by_levels
[1,1]<stderr>:	[self._data[[self.name](http://self.name/)]],
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
[1,1]<stderr>:	result = func(*args, **kwargs)
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/cudf/core/index.py", line 291, in _data
[1,1]<stderr>:	{[self.name](http://self.name/): self._values}
[1,1]<stderr>:  File "/usr/lib/python3.10/functools.py", line 981, in __get__
[1,1]<stderr>:	val = self.func(instance)
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
[1,1]<stderr>:	result = func(*args, **kwargs)
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/cudf/core/index.py", line 252, in _values
[1,1]<stderr>:	return column.arange(
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 2500, in arange
[1,1]<stderr>:	return libcudf.filling.sequence(
[1,1]<stderr>:  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
[1,1]<stderr>:	return func(*args, **kwds)
[1,1]<stderr>:  File "filling.pyx", line 97, in cudf._lib.filling.sequence
[1,1]<stderr>:RuntimeError: tabulate: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
[1,0]<stdout>:Epoch 1/2
[1,1]<stderr>:Error in sys.excepthook:
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/exceptiongroup/_formatting.py", line 71, in exceptiongroup_excepthook
[1,1]<stderr>:TypeError: 'NoneType' object is not callable
[1,1]<stderr>:
[1,1]<stderr>:Original exception was:
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
[1,1]<stderr>:cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
[1,1]<stderr>:Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
[1,1]<stderr>:cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
[1,1]<stderr>:Error in sys.excepthook:
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/exceptiongroup/_formatting.py", line 71, in exceptiongroup_excepthook
[1,1]<stderr>:TypeError: 'NoneType' object is not callable
[1,1]<stderr>:
[1,1]<stderr>:Original exception was:
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
[1,1]<stderr>:cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
[1,1]<stderr>:Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
[1,1]<stderr>:cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
[1,1]<stderr>:Error in sys.excepthook:
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/exceptiongroup/_formatting.py", line 71, in exceptiongroup_excepthook
[1,1]<stderr>:TypeError: 'NoneType' object is not callable
[1,1]<stderr>:
[1,1]<stderr>:Original exception was:
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
[1,1]<stderr>:cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
[1,1]<stderr>:Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
[1,1]<stderr>:cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
[1,1]<stderr>:Error in sys.excepthook:
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/exceptiongroup/_formatting.py", line 71, in exceptiongroup_excepthook
[1,1]<stderr>:TypeError: 'NoneType' object is not callable
[1,1]<stderr>:
[1,1]<stderr>:Original exception was:
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
[1,1]<stderr>:cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
[1,1]<stderr>:Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
[1,1]<stderr>:  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
[1,1]<stderr>:cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
[1,1]<stderr>:Error in sys.excepthook:
[1,1]<stderr>:
[1,1]<stderr>:Original exception was:
[1,1]<stderr>:Error in sys.excepthook:
[1,1]<stderr>:
[1,1]<stderr>:Original exception was:

The processes have been explicitly pinned to use 1 GPU per Horovod’s documentation, using this block of code below:

gpus = tf.config.experimental.list_physical_devices('GPU')
print("GPUs:", gpus)
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    print('device:', gpus[hvd.local_rank()])
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

Steps/Code to reproduce bug
I've attached a sample python script train.py.zip to help recreate this error. This script is based on this example and uses synthetic nvt datasets. I've modified it to use horovod for distributed training across multiple GPUs. The --np flag can be used to specify the number of GPU processes to run the script with. Sample usage:

python train.py --np 1 --> runs to completion
python train.py --np 2 --> errors during data loading

Expected behavior
Training should complete successfully and utilize multiple GPUs; data loading should distribute across GPUs.

Environment details (please complete the following information):

Using Merlin Tensorflow container v23.12

  • Environment location: Cloud - Vertex AI Workbench User Managed Instance with 4 attached GPUs
  • Platform: Linux
  • Method of NVTabular install: Default w/ container
  • Python version: 3.10
  • Cuda version: 12.1
  • Merlin version: 23.8.0
  • Tensorflow version: 2.12.0+nv23.6
  • cudf version: 23.4.0