CuPy loses track of resource handles
sestephens73 opened this issue · 3 comments
sestephens73 commented
Every so often I randomly see an error where I'm doing automatic data movement, as such
A = np.random.rand(NROWS, NCOLS)
# nblocks and NGPUS are integers
mapper = LDeviceSequenceBlocked(nblocks, placement=[gpu(block % NGPUS) for block in range(nblocks)])
A_dev = mapper.partition_tensor(A)
and I get the following error
Exception in task
Traceback (most recent call last):
File "/home1/07999/stephens/Parla.py/parla/task_runtime.py", line 283, in run
task_state = self._state.func(self, *self._state.args)
File "/home1/07999/stephens/Parla.py/parla/tasks.py", line 300, in _task_callback
new_task_info = body.send(in_value)
File "temp.py", line 481, in test_tsqr_blocked
A_dev = mapper.partition_tensor(A)
File "/home1/07999/stephens/Parla.py/parla/ldevice.py", line 144, in partition_tensor
return self.partition(lambda i: data[self.slice(i, n, overlap=overlap), ...],
File "/home1/07999/stephens/Parla.py/parla/ldevice.py", line 131, in partition
return PartitionedTensor([data(i, memory=self.memory(i, kind=memory_kind), device=self.device(i))
File "/home1/07999/stephens/Parla.py/parla/ldevice.py", line 131, in <listcomp>
return PartitionedTensor([data(i, memory=self.memory(i, kind=memory_kind), device=self.device(i))
File "/home1/07999/stephens/Parla.py/parla/ldevice.py", line 344, in wrapper
return memory(data(*args))
File "/home1/07999/stephens/Parla.py/parla/cuda.py", line 75, in __call__
return cupy.asarray(target)
File "/home1/07999/stephens/miniconda3/envs/parla/lib/python3.8/site-packages/cupy/_creation/from_data.py", line 66, in asarray
return core.array(a, dtype, False, order)
File "cupy/core/core.pyx", line 2004, in cupy.core.core.array
File "cupy/core/core.pyx", line 2083, in cupy.core.core.array
File "cupy/core/core.pyx", line 2170, in cupy.core.core._send_object_to_gpu
File "cupy/cuda/stream.pyx", line 245, in cupy.cuda.stream.BaseStream.record
File "cupy_backends/cuda/api/runtime.pyx", line 854, in cupy_backends.cuda.api.runtime.eventRecord
File "cupy_backends/cuda/api/runtime.pyx", line 247, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidResourceHandle: invalid resource handle
bozhiyou commented
This is a recurring error even before the data movement changes. It occasionally happens with an unknown pattern.
insertinterestingnamehere commented
This may be another result of cupy/cupy#5006. I fixed it upstream in cupy/cupy#5083, so hopefully the patch will hopefully be available in their 9.1 release in about a month. We should confirm once that's out.
insertinterestingnamehere commented
cupy 9.1.0 is out, so we just need to confirm that this is fixed now.