fjarri/reikna

Reduce resets offset to 0 when called

Closed this issue · 21 comments

As the title says, even if Reduce is called on along an axis and not the entire array, the offset is reset to zero after the call, but the shape is not updated:

reduce = Reduce(Type(np.ubyte, shape=(65,32), offset=32), predicate_sum(np.ubyte), axes=1)
reduce.parameter.output

Type(uint8, shape=(65,), strides=(1,), offset=0)
which has rendered the

I would've expected it to (ideally) intelligently choose a new offset based on the old offset and strides (so that one 'row' of 32 bytes of offset would become one 'row' of 1 byte of offset after the reduction); or, if that assumes too much, to either leave the offset in place and update the shape to fit, or else reset the offset to zero and remove rows from the "shape" to compensate. The current behavior could be correct if the shape is assumed to be the shape of the "true data" independent of the offset - but I don't know how to make an object like that. Is that possible?

(Would it be easier to write an n-d padding system than to make offsets consistent, or am I completely off-base?)

Now that I think about it, it seems that I had a wrong mental model of how offsets work in PyOpenCL. What load_idx/store_idx/etc assume is that the shape of an array is the shape of the "true data". That is, if you have shape=(8,), all 8 elements are accessible, regardless of the offset (which is just added to the memory pointer to get the address of the 0-th element). Apparently, PyOpenCL works differently: it's your job to ensure that the offset and the shape are consistent with the total allocated memory size. Which, essentially, means that if you're passing a nonzero offset to the constructor, you must also pass a data pointer allocated elsewhere.

On a separate note, I don't think any system more complicated than offset+strides is necessary, since (if I'm not mistaken) it can describe any set of paddings in n dimensions.

Now, there are two problems there. First is Reduce resetting the offset (and the strides) to default values for the output array. The simplest thing would be to just allow the user to specify the output array metadata explicitly. I am not sure what would be the best algorithm of deriving it. Perhaps keeping the offset and the strides of non-reduced dimensions? This way, if you reduce a view of a larger array, the result will still be a view on that array with the reduced coordinate set to 0.

The second problem is the inconsistency between the Computation's expectations about the shape and the offset, and what pyopencl.Array allows one to do. I think I can ensure that reikna.Array objects have consistent metadata, but ensuring that for an external pyopencl.Array will be the responsibility of the user.

Ah, I had missed that you can manually specify strides. Then yes, offset + strides + manual allocation covers all possibilities. (By the way, how do you instantiate an agnostic array from a pointer? I didn't see it covered in the documentation, but maybe I just missed it.)

Keeping the strides was roughly what I was suggesting above - it seems like it'd do the intuitively "right" thing, which is reduce over and remove the padding around the axes you're squishing/removing and leave all the others alone. Though I was suggesting that, rather than keeping the strides themselves it seems like you'd want to keep the relative strides - it seems like that would be a more common use case than the rather awkward case of trying to Reduce an array in-place, which essentially fragments your memory usage.

(To be honest, I'm rather new to GPGPU in general, so I don't have much to say about pyopencl; I'm only familiar with PyCUDA at the moment, and that barely. >.>)

By the way, what's your use case for offsets? The only scenario I've considered so far is a view, which in PyCUDA does not require offsets, because the view object is created from a modified pointer to the original data.

Padding, basically. I want to efficiently calculate the moving average/moving sum of a large matrix, and as far as I know the best way to do that is to Scan the array and then just subtract each prefix sum from the one n elements before. That has obvious edge-case problems at the beginning and end, though, so rather than deal with them I'm using offsets so I can run off the edge of the array and not worry about walking on memory.

Part 1, correct support of non-zero offsets in computations. The next thing is to support them in temporary arrays. The problem here is that temporary arrays are sometimes repacked, which requires updating all gpudata properties. A temporary array with an offset has its gpudata depending on some other array, which has to be updated too, and gpudata of the first array re-calculated.

So just for my own edification - the underlying implementation of an offset/custom-strided array is basically a pre-defined, managed view into another, underlying array?

What do you mean by "repacking" here?

the underlying implementation of an offset/custom-strided array is basically a pre-defined, managed view into another, underlying array?

Yes, except there is a difference between how PyCUDA and PyOpenCL handle it, and I tried to preserve it. PyCUDA carries a reference to the base array and a (base pointer + offset) value, without actually having the offset attribute. PyOpenCL carries a reference to the base buffer (not the array), and an explicit offset. From the load_idx/store_idx level it is the same thing, of course, but I wanted to preserve compatibility.

What do you mean by "repacking" here?

Different temp_array allocations may reside in the same physical allocations to reduce the memory footprint. If you call it from a ComputationPlan object, it makes sure that the temporary arrays that are used at the same time reside in different physical locations and there is no clash. Normally Reikna adds physical buffers as needed when you request new temporary arrays, but over time it can lead to a non-optimal distribution of virtual buffers, so a repack is needed, which is supposed to find the minimum amount of physical memory to host all the virtual buffers, given the dependencies. There's also an option to repack on every temp_array request, but it is a bit slow.

The temporary memory manager holds the references to all temporary memory arrays, and during repacking it substitutes their .gpudata attributes with the new ones. Now if we have an nonzero offset array which has its gpudata dependent on some other array, it has to be handled somehow.

Ah, I ... sort of see. So in the case of an offset array, the temporary memory manager needs to to know to recalculate the gpudata parameters for the original array, in order to calculate the parameters for the array it was actually passed?

Ah, I ... sort of see. So in the case of an offset array, the temporary memory manager needs to to know to recalculate the gpudata parameters for the original array, in order to calculate the parameters for the array it was actually passed?

Yep, pretty much. I think I made that working now, and added a custom output_arr_t parameter for Reduce, so you can try your code with it and see if it is working.

I am still unsure about the derivation of the output array parameters, you can't really just scale strides, because they are not necessarily proportional to the array dimensions... If you have a proposed algorithm (in a form of a function (shape, strides, offset, reduce_axes) -> (shape, strides, offset)), please post it here.

Well, let's see...

Fundamentally, thought of as an operator on a purely linear 1d memory bank, Reduce over axis n is saying:

def reduce(arr, f, n):
    for i from 0 to strides[n] - 1:
        acc = starting value
        for j from 0 to the end of the dimension in steps of strides[n]:
             acc = f(acc, f[j])

or so, right? That middle step - "from 0 to the end of the dimension" - already assumes that the strides are put together in some sane way so that the array makes sense as "a physical brick in a physical box" -- that is, the array is believably pretending to be an n-dimensional object floating in an n-dimensional room, despite being fundamentally 1d.

Or, well, I guess to minimize the number of assumptions, you can say that Reduce fundamentally assumes that the array at least makes sense as a three-dimensional object - one axis with box-length (padded length) strides[0]/strides[n], consisting of all preceding strides; one axis with length strides[n] / strides[n-1], the reduction axis; and one axis with length strides[n-1], combining all succeeding strides. And if there's any weirdness, at least strides[0] should be an integer multiple of strides[n] and strides[n] should be an integer multiple of strides[n-1], i.e. all the weirdness is either "before" or "after" the reduced dimension.

So since you have to make that assumption anyway for "reduction along an axis" to make sense, may as well run with it -- for each reduction axis n, take

strides -> [stride / strides[n] for stride in strides[:n]] + strides[n+1:]

I suppose if you want to be really careful about pathological edge cases like https://gist.github.com/swayson/20357dc7ad07b7a3d3a0292396be00a6, you could do

strides -> [stride / strides[n] for stride in strides if stride > strides[n]]

which seems like it'd do the right thing in any reasonable scenario that there's a right thing to do. In either case, the offset should probably be treated as if it were part of the array proper - as if the algorithm was bothering to reduce over some number of empty rows appended to the beginning of the matrix - so the offset should just be divided by strides[n]. Well, there's again the weird edge case of someone trying "n empty rows plus k empty columns" or the like, so I guess it would be more like

offset -> offset / strides[n] + offset % strides[n]

or so? That doesn't handle the case of "n empty rows minus k empty columns" but I'm not sure there is a good way of automatically detecting that...

Anyway, that's my thought process. Or did you manage to implement Reduce without that assumption somehow? If so then the above is pretty much irrelevant :p.

Anyway, that's my thought process. Or did you manage to implement Reduce without that assumption somehow? If so then the above is pretty much irrelevant :p.

Right now if you want some non-standard output strides/offset, you have to specify it explicitly. (Actually, now that I think about it, the same effect could be achieved by attaching a copy() transformation with the required output metadata, but I guess it's a bit less code to write). I suggest you try your code with it first to see if all the other parts (temporary memory, nonzero offsets) are working correctly.

The are no restrictions on the offset and the strides, so I do not think that your algorithm would work every time. Technically, neither is required to be a multiple of anything (currently Reikna requires strides to be multiples of itemsize, but even that can be relaxed).

The thing with guessing the new strides/offset, is that it is just too complicated and more often than not irrelevant for the user. And the required algorithm may differ depending on the user's goals. For example, take your offset calculation. Say, I have a 10x10 2D array of chars, which is padded by 3 elements from all sides to make life easier for my convolution calculation or something like that. So for it offset=51, strides=(16, 1). If I reduce it over axis=0, I would probably prefer to keep the padding over the remaining axis, so the desired result for me is offset=3, strides=1. Your formula gives offset=6. I'm leaning to the thought that perhaps this guessing function belongs in your code, because you are the one who knows what kind of result you need.

Well, I don't disagree with that conclusion, though I personally tend to lean towards providing a reasonable default and allowing customization on top of that.

With regards to testing my code - unfortunately, I don't see a way to instantiate a temp_array from an existing gpuarray, which I think I need to do in order to have proper padding - I'm guessing that passing shape and offset alone will create an array of the right shape after offset zeroes (though it'd be nice if that were stated explicitly in the docs), but no padding after the array. Do you have any suggestions?

though I personally tend to lean towards providing a reasonable default and allowing customization on top of that.

Well, now you have a reasonable default of the offset and the strides being reset to default values, and you can customise it either with output_arr_t or with a transformation ;)

With regards to testing my code - unfortunately, I don't see a way to instantiate a temp_array from an existing gpuarray

Hm, isn't that what temp_array_like() does?

I'm guessing that passing shape and offset alone will create an array of the right shape after offset zeroes (though it'd be nice if that were stated explicitly in the docs)

Stated that Reduce resets the strides and the offset by default, or how strides and offset work in general?

but no padding after the array.

That's a good point, currently the total size is taken as strides[0] * shape[0], which overestimates the actually required size (edit: or underestimates it in case of weird overlapping strides). I've just checked, and both PyCUDA and PyOpenCL seem to ignore strides completely when determining the buffer size, which is quite worrying... I think I will have to do that at Reikna level.

Stated that Reduce resets the strides and the offset by default, or how strides and offset work in general?

Strides and offsets in general, particularly creating new arrays if/when Thread.array() gets implemented.

Hm, isn't that what temp_array_like() does?

temp_array_like(), as I understand it, creates an array with the same parameters as the passed array. To reserve enough memory for a padded array, since gpuarray has no padding options, I'd have to include the padding in the shape of the gpuarray. But if I then pass that to temp_array_like(), I end up with a temp_array that has that full shape and zero offset instead of the "true" shape and padded offset that I want. Unless there's a way to change the shape and offset after creation that I missed? I tried just doing .offset = when I was debugging earlier, but it didn't seem to take.

That's a good point, currently the total size is taken as strides[0] * shape[0], which overestimates the actually required size (edit: or underestimates it in case of weird overlapping strides). I've just checked, and both PyCUDA and PyOpenCL seem to ignore strides completely when determining the buffer size, which is quite worrying... I think I will have to do that at Reikna level.

... yes, yes it is...

Also, it underestimates it in the case where we want entire empty rows before and after the array - so like your example earlier of a 10x10 array with 3 elements padding in all directions, strides[0] * shape[0] = 16 * 10 = 160, but the correct size should be 16x16 = 256.

Unless there's a way to change the shape and offset after creation that I missed?

You can use temp_array() and pass the strides and the offset explicitly. Of course, it is not as convenient as just using an existing array as a model, but that would require breaking the compatibility of reikna.Array with GPUArray.

Documenting strides/offset and ensuring the correct buffer size seems reasonable, but that's probably a separate issue.

Sorry, I'm still confused.

Let's say I call

temp_array((10,10), np.ubyte, strides=(16,1), offset=51)

Doesn't that allocate an array only 16 * 10 + 51 = 211 bytes long, leaving no padding after the array?

Alternatively, I could try something like

arr = GPUArray((16, 16), np.ubyte)
temp = temp_array_like(arr[3:13,3:13])

and try to define the temp_array from a view, but I don't think that's how temp_array_like works?

Doesn't that allocate an array only 16 * 10 + 51 = 211 bytes long, leaving no padding after the array?

Ideally it should, but since it relied on GPUArray doing the right thing, I'm afraid it currently doesn't. I will fix it soon. If you want the padding in the end, perhaps there should be some kind of an optional switch for that...

arr = GPUArray((16, 16), np.ubyte)
temp = temp_array_like(arr[3:13,3:13])

That's not a good idea, since a view of a GPUArray has offset=0 (or, rather, it does not have an offset at all). temp_array_like just copies whatever metadata is present in the given object.

BTW, I think that this discussion is diverging more and more from the original issue. If your code is working with the recent changes, perhaps I should close it and continue the discussion in a new issue, or by email.

Fair enough. Reduce with output offsets works on its own, at least - I can't check my original code yet because of the padding issue, but that does belong in a new issue. I'll also open a documentation issue while I'm at it, if that's fine?

(By the way, thank you for your very rapid responses and code updates. I'm not used to getting issue replies this quickly!)

I'll also open a documentation issue while I'm at it, if that's fine?

Yes, it will be especially useful if you tell me all the parts that are not clear enough, it is hard for me to tell, after all :)