zarr-developers/zarr-python

resize(): Improve docs & control of what is modified

hailiangzhang opened this issue ยท 10 comments

Problem description

For a zarr array with shape of (4,6) and chunksize of (2,3), I did the following:

  1. reduce its shape to be (1,1)
  2. increase its shape back to be (4,6)

I found that the final array after step-2 brought back the original values in the first whole chunk.
However, as an end user, I am expecting only the first element to be preserved (since I had already shrunk the shape to be (1,1), and the chunksize should be transparent to the end user).

Minimal, reproducible code sample

import zarr
import numpy as np

z = zarr.open('data', mode='w', shape=(4, 6), chunks=(2, 3), dtype='i4')
z[:] = 1

print("Original zarr array with shape (4,6):")
print(z[:])

print("\nAfter resizing shape to (1,1):")
z.resize((1,1))
print(z[:])

print("\nAfter resizing shape back to (4,6):")
z.resize((4,6))
print(z[:])

print("\nBut I was expecting it to be:")
arr_expected = np.zeros((4,6), dtype='i4')
arr_expected[0,0] = 1
print(arr_expected[:])

Output

Original zarr array with shape (4,6):
[[1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 1]]

After resizing shape to (1,1):
[[1]]

After resizing shape back to (4,6):
[[1 1 1 0 0 0]
 [1 1 1 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]]

But I was expecting it to be:
[[1 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]]

Version information

  • zarr.version: 2.11.2

@hailiangzhang, interesting. I can definitely understand your surprise. Reading https://zarr.readthedocs.io/en/stable/api/core.html?highlight=resize#zarr.core.Array.resize, however, it's only clear (at least for me) that out-of-bound chunks are removed not that the in-bound chunk is re-written. Assuming that's functionality that someone already relies on, it might take an extra argument to resize() for rewriting.

Ah, probably this is not very surprising based on the notes you provided:)

So, in this case, I can add the testing to my PR as we originally planned.

In the long term, as you mentioned above, probably we can add an extra argument to resize() for rewriting (which maybe useful and less confusing for some end users like me:)

This being said, this issue report is more like a feature request instead a bug report, so please feel free to close it if there is no immediate plan to add this feature (or leave it here and someone may be able to add it when having a chance:).

Thanks again for your comments @joshmoore !

Deletion can be fairly expensive so I think this was implemented intentionally to avoid deleting data and instead being a metadata only change (fairly quick). Maybe it is worth documenting that resize does not necessarily delete the data?

@jakirkham , agreed (and that's actually what I would imagine:)

Since resize actually deletes data at chunk level of resolution, this could be a little bit unexpected by the end users (who don't need to be aware of the internal data organization), and therefore yes, we could probably explain this more clearly in the documentation (bold is what I added):

If one or more dimensions are shrunk, any chunks falling outside the new array shape will be deleted from the underlying store. It is noteworthy that the chunks partially falling inside the new array (i.e. boundary chunks) will remain intact, and therefore, any data falling outside of the new array shape but inside the boundary chunks would be recovered by subsequent resize operation that increases the array shape.

If this looks correct and helpful, I will be happy to send another tiny PR:)

๐Ÿ‘ for the doc improvement. I've updated the title of this issue to:

  • resize(): Improve docs & control of what is modified

with the control being potential new arguments, etc.

๐Ÿ‘ for the doc improvement. I've updated the title of this issue to:

  • resize(): Improve docs & control of what is modified

with the control being potential new arguments, etc.

Cool, I just sent a small PR which adds the comments as described above.
Thanks @joshmoore !

Hi, since my PR has been merged, feel free to close this ticket (and please let me know if I am supposed to do that:)

Happy to leave that up to you. If you think new method arguments are worth it, feel free to leave open. Otherwise, feel free to close.

I found out that if you put every item in its own individual chunk, you will get the desired output after resizing.

SAMPLE

import zarr
import numpy as np

z = zarr.open('data', mode='w', shape=(4, 6), chunks=(1, 1), dtype='i4')
z[:] = 1

print("Original zarr array with shape (4,6):")
print(z[:])

print("\nAfter resizing shape to (1,1):")
z.resize((1,1))
print(z[:])

print("\nAfter resizing shape back to (4,6):")
z.resize((4,6))
print(z[:])

OUTPUT

Original zarr array with shape (4,6):

[[1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 1]]

After resizing shape to (1,1):

[[1]]
After resizing shape back to (4,6):
[[1 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]]

Caveat: This is not optimal when dealing with a large data array and it will also create a huge number of files.

Thanks for looking into this, @Jaykold. You're right that having a chunk size of 1 will work around the issue, but doing that for all dimensions with anything other than toy data isn't really an option.