fjarri/reikna

"too many resources requested for launch" when running Transpose on small, padded array

robertmaxton42 opened this issue · 4 comments

Consider the code:

import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as gp
import reikna.cluda as cluda
from reikna.algorithms import Transpose

api = cluda.any_api()
thr = api.Thread(pycuda.autoinit.context)

arrbase = gp.zeros((10, 10, 7), np.ubyte)
arrgpu = thr.array((8,8,7), np.ubyte, arrbase.strides, (arrbase.strides[0] + arrbase.strides[1]), base=arrbase)
arr = np.asarray([[[(i + 1) * (j + 1) * (k + 1) for k in range(7)] for j in range(8)] for i in range(8)], np.ubyte)
thr.to_device(arr, arrgpu)

outtype = Type(np.ubyte, (8, 7, 8), (70,10,1), 70 + 1, arrbase.nbytes)
tr = Transpose(arr, outtype, axes=(0,2,1)).compile(thr)
out = thr.array(outtype.shape, outtype.dtype, outtype.strides, outtype.offset, outtype.nbytes)
tr(out, arr)

gets me a nice big red box that ends with LaunchError: cuLaunchKernel failed: too many resources requested for launch. Checking outtype in a separate cell gives

Type(uint8, shape=(8, 7, 8), strides=(70, 10, 1), offset=71, nbytes=700)

so clearly the system isn't actually running out of memory or something. Googling the error leads me to guess that Transpose is asking for too many threads per block internally, but I can't be sure without better familiarity with the internals...

Thanks for all the help!

(Possibly relevant: I'm running this on a rather old 755M. CC 3.0.)

Note that you are passing a numpy array to tr as the second argument:

tr(out, arr)

If you replace it with arrgpu, the error disappears. I am not sure what is causing the underlying error though.

... Okay, on the one hand, that's a silly whoops on my part, apologies. On the other, that's a really weird way for it to go wrong, though.

So, I looked into it, and the reason is that a numpy array as an argument to a kernel results in its whole contents (that is, not a pointer) being attached to the argument list (see driver.py:_build_arg_buf() in PyCUDA). Since the array is pretty large, it results in the error from CUDA.