"too many resources requested for launch" when running Transpose on small, padded array

Question

"too many resources requested for launch" when running Transpose on small, padded array

robertmaxton42 opened this issue 6 years ago · 4 comments

Consider the code:

import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as gp
import reikna.cluda as cluda
from reikna.algorithms import Transpose

api = cluda.any_api()
thr = api.Thread(pycuda.autoinit.context)

arrbase = gp.zeros((10, 10, 7), np.ubyte)
arrgpu = thr.array((8,8,7), np.ubyte, arrbase.strides, (arrbase.strides[0] + arrbase.strides[1]), base=arrbase)
arr = np.asarray([[[(i + 1) * (j + 1) * (k + 1) for k in range(7)] for j in range(8)] for i in range(8)], np.ubyte)
thr.to_device(arr, arrgpu)

outtype = Type(np.ubyte, (8, 7, 8), (70,10,1), 70 + 1, arrbase.nbytes)
tr = Transpose(arr, outtype, axes=(0,2,1)).compile(thr)
out = thr.array(outtype.shape, outtype.dtype, outtype.strides, outtype.offset, outtype.nbytes)
tr(out, arr)

gets me a nice big red box that ends with LaunchError: cuLaunchKernel failed: too many resources requested for launch. Checking outtype in a separate cell gives

Type(uint8, shape=(8, 7, 8), strides=(70, 10, 1), offset=71, nbytes=700)

so clearly the system isn't actually running out of memory or something. Googling the error leads me to guess that Transpose is asking for too many threads per block internally, but I can't be sure without better familiarity with the internals...

Thanks for all the help!

Answer 1 · 2018-08-10T03:39:21.000Z

(Possibly relevant: I'm running this on a rather old 755M. CC 3.0.)

Answer 2 · 2018-08-10T05:10:43.000Z

Note that you are passing a numpy array to tr as the second argument:

tr(out, arr)

If you replace it with arrgpu, the error disappears. I am not sure what is causing the underlying error though.

Answer 3 · 2018-08-10T05:22:20.000Z

... Okay, on the one hand, that's a silly whoops on my part, apologies. On the other, that's a really weird way for it to go wrong, though.

Answer 4 · 2018-08-12T08:42:26.000Z

So, I looked into it, and the reason is that a numpy array as an argument to a kernel results in its whole contents (that is, not a pointer) being attached to the argument list (see driver.py:_build_arg_buf() in PyCUDA). Since the array is pretty large, it results in the error from CUDA.