"too many resources requested for launch" when running Transpose on small, padded array
robertmaxton42 opened this issue · 4 comments
Consider the code:
import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as gp
import reikna.cluda as cluda
from reikna.algorithms import Transpose
api = cluda.any_api()
thr = api.Thread(pycuda.autoinit.context)
arrbase = gp.zeros((10, 10, 7), np.ubyte)
arrgpu = thr.array((8,8,7), np.ubyte, arrbase.strides, (arrbase.strides[0] + arrbase.strides[1]), base=arrbase)
arr = np.asarray([[[(i + 1) * (j + 1) * (k + 1) for k in range(7)] for j in range(8)] for i in range(8)], np.ubyte)
thr.to_device(arr, arrgpu)
outtype = Type(np.ubyte, (8, 7, 8), (70,10,1), 70 + 1, arrbase.nbytes)
tr = Transpose(arr, outtype, axes=(0,2,1)).compile(thr)
out = thr.array(outtype.shape, outtype.dtype, outtype.strides, outtype.offset, outtype.nbytes)
tr(out, arr)
gets me a nice big red box that ends with LaunchError: cuLaunchKernel failed: too many resources requested for launch
. Checking outtype
in a separate cell gives
Type(uint8, shape=(8, 7, 8), strides=(70, 10, 1), offset=71, nbytes=700)
so clearly the system isn't actually running out of memory or something. Googling the error leads me to guess that Transpose
is asking for too many threads per block internally, but I can't be sure without better familiarity with the internals...
Thanks for all the help!
(Possibly relevant: I'm running this on a rather old 755M. CC 3.0.)
Note that you are passing a numpy
array to tr
as the second argument:
tr(out, arr)
If you replace it with arrgpu
, the error disappears. I am not sure what is causing the underlying error though.
... Okay, on the one hand, that's a silly whoops on my part, apologies. On the other, that's a really weird way for it to go wrong, though.
So, I looked into it, and the reason is that a numpy
array as an argument to a kernel results in its whole contents (that is, not a pointer) being attached to the argument list (see driver.py:_build_arg_buf()
in PyCUDA). Since the array is pretty large, it results in the error from CUDA.