xarray-contrib/xbatcher

How do you put batches back together after processing

Opened this issue · 4 comments

In #37, @robintw wrote:

  1. How do you put batches back together after processing?
    My machine learning model is producing a single value as an output, so for a batch of 100 64x64 patches, I get an output of a 100-element array. What's the best way of putting this back into a DataArray that has the same format/co-ordinates as the original input array? I'd be happy with either an array with dimensions of original_size / 64 in both the x and y dimension, or an array of the same size as the input with the single output value repeated for each of the input pixels in that batch.

I've tried to put some of this together myself, but it seems that the x co-ordinate value in the batch DataArray is the same for each batch. I'd have thought this would represent the x co-ordinates that had been extracted from the original DataArray, but it doesn't seem to. For example, if I run:

batches = []
for i, batch in enumerate(bgen):
batches.append(batch)
if i == 1:
break
to get the first two batches, I can then compare their x co-ordinate values:

np.all(batches[0].to_array().squeeze().x == batches[1].to_array().squeeze().x)
and it shows that they're all equal.

Do you have any ideas as to what I could do to be able to put the batches back together?

@tcchiao and I discussed today and she is planning to add an example to the demo notebook.

Good...hard part!

Just wondering if this example is available some where?

I came across this discussion: https://discourse.pangeo.io/t/vectorized-sklearn/1444 , which seemed to be solving a similar problem.

Just wondering if this example is available some where?

AFAIK the example has not yet been made. It's helpful to hear more interest in this component of the documentation.

I worked out it is often quiet simple to put batches back together.
Atleast in the simple situations that I am working with, just using .unstack('samples') will put the batches back together into original geospatial data format. Happy to add a few lines about this in the demo notebook, if you think that is the appropriate place for it.