fjarri/reikna

Transpose on padded arrays - unclear

robertmaxton42 opened this issue · 6 comments

Related to the last issue - it's not entirely clear how to use Transpose on a padded array. After fixing my silly mistake last time, my output reads:

out[:,:,0]
array([[  0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0],
       [  6,  12,  18,  24,  30,  36,  42],
       [  8,  16,  24,  32,  40,  48,  56],
       [  6,  12,  18,  24,  30,  36,  42],
       [  0,   0,   0,   0,   0,   0,   0],
       [ 32,  64,  96, 128, 160, 192, 224],
       [ 30,  60,  90, 120, 150, 180, 210]], dtype=uint8)

For comparison, the correct result ought to be

arrgpu[:,0,:]
array([[ 1,  2,  3,  4,  5,  6,  7],
       [ 2,  4,  6,  8, 10, 12, 14],
       [ 3,  6,  9, 12, 15, 18, 21],
       [ 4,  8, 12, 16, 20, 24, 28],
       [ 5, 10, 15, 20, 25, 30, 35],
       [ 6, 12, 18, 24, 30, 36, 42],
       [ 7, 14, 21, 28, 35, 42, 49],
       [ 8, 16, 24, 32, 40, 48, 56]], dtype=uint8)

plus or minus some padding zeroes.

Now, I might just be using Transpose's new padding feature wrong - but, uh, in that case I'm not entirely sure how to use it right, so a documentation update might be in order.

Thanks!

I think it is working correctly; the problem arises when you copy it to the CPU. So far I've been relying on PyCUDA/PyOpenCL to do that, but they currently have problems with non-standard offsets and strides.

Also, I'm not even sure it is possible to create an array with offset in numpy.

numpy can make padded arrays, but as far as I know it doesn't keep track of them as we'd like.

Is there any way to write __host__ code in either PyCUDA or Reikna, do you happen to know? If there is, I could try experimenting with managed memory or otherwise doing my own copying. (I could write it in pure CUDA C++, compile it, and call it with ctypes, but that would involve, yes, exploring the wonderful world of Python/C++ calling, which I have no familiarity with whatsoever as of yet... >.>)

Barring that, I suppose it'd work if I pass arrbase as a parameter, and then I can transpose the whole base array and make my own views internally. Less than elegant/intuitive from a library-user perspective, but I don't actually seriously expect anyone else to use this code, I suppose.

... Actually, I can't do that, because there's no way to plan the creation of an array that takes a base or base_data parameter. Hm.

numpy can make padded arrays, but as far as I know it doesn't keep track of them as we'd like.

Could you point me to the relevant place in the docs?

Is there any way to write host code in either PyCUDA or Reikna, do you happen to know?

I don't think PyCUDA supports it, and by extension, neither does Reikna.

If there is, I could try experimenting with managed memory or otherwise doing my own copying. (I could write it in pure CUDA C++, compile it, and call it with ctypes

Or, perhaps, cffi would be a better variant.

Barring that, I suppose it'd work if I pass arrbase as a parameter, and then I can transpose the whole base array and make my own views internally. Less than elegant/intuitive from a library-user perspective, but I don't actually seriously expect anyone else to use this code, I suppose.

As long as the padded array stays on GPU, it is processed correctly. It's only when you copy it the problems arise. The question is, what variant to prefer: remove the padding on copy and return a contiguous array on CPU, or preserve the structure and return a padded array (well, technically, both can be available, but one has to be the default)? Do you need the latter in your code?

Can you point me to the relevant place in the docs?

Well, for example, we have np.pad, which pads an array in a variety of helpful ways but just returns a normal numpy array at the end of it.

I don't think PyCUDA supports it, and by extension, neither does Reikna.

Unfortunate.

Or, perhaps, cffi would be a better variant.

Ooh. Yes, that does look promising. Thanks.

As long as the padded array stays on GPU, it is processed correctly. It's only when you copy it the problems arise. The question is, what variant to prefer: remove the padding on copy and return a contiguous array on CPU, or preserve the structure and return a padded array (well, technically, both can be available, but one has to be the default)? Do you need the latter in your code?

Not on return, no. As long as I'm processing it I need the padding, but when I transfer it back, then for this code in particular at least I'm basically done with processing and only care about prettyprinting of one form or another.