ut-parla/Parla.py

Threaded Cupy Warmup/Initialization Error

Opened this issue · 0 comments

When running w/ cupy 9.3.0 and cudatoolkit 11.2.2 runs I occasionally see the following error when running Parla on the TSQR demo app.

It could be something wrong with my env, but logging it here. While its a rare error for each Parla instance, it happens fairly often on larger MPI runs.

Unexpected exception in Task handling Traceback (most recent call last): File ".../miniconda3/lib/python3.8/site-packages/parla/task_runtime.py", line 515, in run component.initialize_thread() File ".../miniconda3/lib/python3.8/site-packages/parla/cuda.py", line 250, in initialize_thread cupy.asnumpy(cupy.sqrt(a)) File ".../miniconda3/lib/python3.8/site-packages/cupy/__init__.py", line 773, in asnumpy return a.get(stream=stream, order=order) File "cupy/_core/core.pyx", line 1567, in cupy._core.core.ndarray.get File "cupy/_core/core.pyx", line 1636, in cupy._core.core.ndarray.get File "cupy/_core/core.pyx", line 1644, in cupy._core.core.ndarray.get File "cupy/cuda/memory.pyx", line 551, in cupy.cuda.memory.MemoryPointer.copy_to_host_async File "cupy_backends/cuda/api/runtime.pyx", line 693, in cupy_backends.cuda.api.runtime.memcpyAsync File "cupy_backends/cuda/api/runtime.pyx", line 273, in cupy_backends.cuda.api.runtime.check_status cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidValue: invalid argument Unexpected exception in Task handling

I don't see how 'a' would fail to exist after a sync but the gpu->cpu copy is failing.