zarr-developers/zarr-python

Proposal: Add Array.blocks using new BlockIndexer (Prototype Code Included)

tasansal opened this issue · 7 comments

Problem description

I have a use case downstream where we want to access "blocks" of chunks. I have implemented a prototype that functions like dask.array.Array.blocks.
It uses the existing zarr.indexing machinery and matches the API of existing indexers.

This allows us to pull a "block" of chunks from data using slicing logic.
For instance, if we have an array with shape (10, 20, 30) with chunk sizes (5, 4, 10)

  • array.blocks[0] maps to array[:5]
  • array.blocks[..., 0] maps to array[:, :, :10]
  • array.blocks[1, 1, 1] maps to array[5:10, 4:8, 10:20]
  • array.blocks[:, 1:4, :] maps to array[:, 4:16, :]

Why not just use dask array and method?

  • We don't want to have dask as a hard requirement in our library.
  • In many simple queries, vanilla zarr is much faster than converting to dask.array + scheduling overhead.
  • We don't want to install all dependencies of dask for just this functionality to keep our library lightweight.

If there are no objections I can start adding this in a new PR as soon as possible.
If we don't want this in zarr, I can keep this in our library as an extension to zarr.

Once the code below are implemented, all of the following evaluates to True as expected.

np.array_equal(array.blocks[0], array[:5])
np.array_equal(array.blocks[..., 0], array[:, :, :10])
np.array_equal(array.blocks[1, 1, 1], array[5:10, 4:8, 10:20])
np.array_equal(array.blocks[:, 1:4, :], array[:, 4:16, :])

Tests and docs must be written, of course, I don't have that yet.

Implementation Details

Methods and attributes that would go into zarr.core.Array

class Array:
    def __init__(...):
        ...
        self._blocks = Blocks(self)

    @property
    def blocks(self):
        return self._blocks

    def get_block_selection(self, selection, out=None, fields=None):
        if not self._cache_metadata:
            self._load_metadata()

        # check args
        check_fields(fields, self._dtype)

        # setup indexer
        indexer = BlockIndexer(selection, self)

        return self._get_selection(indexer=indexer, out=out, fields=fields)

    def set_block_selection(self, selection, value, fields=None):
        # guard conditions
        if self._read_only:
            raise ReadOnlyError()

        # refresh metadata
        if not self._cache_metadata:
            self._load_metadata_nosync()

        # setup indexer
        indexer = BlockIndexer(selection, self)

        self._set_selection(indexer, value, fields=fields)

BlockIndexer class (compare to zarr.indexing.OrthogonalIndexer)

class BlockIndexer:
    def __init__(self, selection, array):
        # handle ellipsis
        selection = replace_ellipsis(selection, array._shape)

        # setup per-dimension indexers
        dim_indexers = []
        for dim_sel, dim_len, dim_chunk_len in \
                zip(selection, array._shape, array._chunks):

            if is_integer(dim_sel):
                block_start = dim_sel * dim_chunk_len
                block_stop = block_start + dim_chunk_len
                block_slice = slice(block_start, block_stop)

            elif is_slice(dim_sel):
                start, stop, _ = dim_sel.indices(dim_len // dim_chunk_len)
                block_start = start * dim_chunk_len
                block_stop = stop * dim_chunk_len
                block_slice = slice(block_start, block_stop)

            else:
                raise IndexError('unsupported selection item for block indexing; '
                                 'expected integer or slice, got {!r}'
                                 .format(type(dim_sel)))

            dim_indexer = SliceDimIndexer(block_slice, dim_len, dim_chunk_len)
            dim_indexers.append(dim_indexer)

        self.dim_indexers = dim_indexers
        self.shape = tuple(s.nitems for s in self.dim_indexers)
        self.drop_axes = None

    def __iter__(self):
        for dim_projections in itertools.product(*self.dim_indexers):
            chunk_coords = tuple(p.dim_chunk_ix for p in dim_projections)
            chunk_selection = tuple(p.dim_chunk_sel for p in dim_projections)
            out_selection = tuple(p.dim_out_sel for p in dim_projections
                                  if p.dim_out_sel is not None)

            yield ChunkProjection(chunk_coords, chunk_selection, out_selection)

Blocks property class for slicing (compare to zarr.indexing.VIndex or zarr.indexing.OIndex

class Blocks:
    def __init__(self, array):
        self.array = array

    def __getitem__(self, selection):
        fields, selection = pop_fields(selection)
        selection = ensure_tuple(selection)
        return self.array.get_block_selection(selection, fields=fields)

    def __setitem__(self, selection, value):
        fields, selection = pop_fields(selection)
        selection = ensure_tuple(selection)
        return self.array.set_block_selection(selection, value, fields=fields)

This seems to also solve and close: #543 and #545

Also, potentially may be used as a common indexer for Array._chunk_getitem(s).

Thoughts?

Hi @tasansal. I do have the feeling that this issue comes up frequently. (I have a vague sense that both @GenevieveBuckley and @thewtex have dealt with related issues.)

Are the code snippets above the entirety of the prototype? If so, would it make sense to get them into a PR with tests? (And, would those tests require dask? If so, it could potentially be added as an optional dev requirement as with fsspec and s3fs)

@joshmoore
That is pretty much all of it in pieces. I already implemented it in my dev environment and seems to work as expected.
Everything is zarr so dask is not required for any part of it.

It passes my simple manual tests, but unit tests need to be written for sure. Especially for edge cases like negative indices etc.

If there are existing mechanisms to calculate chunk boundaries, that can also be reused, but I didn't see one. To be more specific, this part:

start, stop, _ = dim_sel.indices(dim_len // dim_chunk_len)
block_start = start * dim_chunk_len
block_stop = stop * dim_chunk_len
block_slice = slice(block_start, block_stop)

I already implemented it in my dev environment and seems to work as expected. Everything is zarr so dask is not required for any part of it.

👍 Sounds good, @tasansal.

@joshmoore, do we need more input from others or should I go ahead and start?

For my part, I'd say go for it. 👍

I believe this should have been closed by #1428.

@tasansal - feel free to reopen if I have that wrong.