data-apis/array-api-compat

BUG: `size` returns `nan` for Dask array with unknown `shape`

lucascolley opened this issue · 7 comments

>>> xp
<module 'scipy._lib.array_api_compat.dask.array' from ...
>>> x = xp.asarray([1, 2, 3])            
>>> x
dask.array<array, shape=(3,), dtype=int64, chunksize=(3,), chunktype=numpy.ndarray>
>>> from scipy._lib.array_api_compat import size
>>> size(xp.unique_values(x))
nan
>>> size(xp.unique_values(x).compute())
3

I think size should call compute() if it has to?

Okay, this was already mentioned for shape in scikit-learn/scikit-learn#26724 (comment)

This is technically allowed I think
https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.shape.html#shape

(technically speaking, it should be None instead of nan, but I don't think the difference is too important here)

Yeah, the shape is allowed to be None, but I think the array-api-compat helper size should trigger the computation if needed to return the actual size.

I think it might be worth adding a version of the wrapped dask.array that calls compute, but I think it's probably better to wait a bit to better understand when/why downstream libraries need the shape.

It might also be possible to return a dask.delayed shape for dask (advantage here is that we wouldn't break the dask graph by computing), which is why I want to wait on doing this, but not 100% sure of this.

The example I've bumped into in SciPy is n_clusters = int(xp.unique_values(T).shape[0])

It's not clear to me whether the wrappers here should be automatically calling compute(). I'd like to get some better understanding of the use-cases before doing that.

Note that the original purpose of the size() helper was to work around PyTorch not making size an attribute. Outside of that the behavior should just match x.size. The standard does say size = None is allowed https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.size.html#array_api.array.size (as @lithomas1 points out, dask uses nan instead of None, which we could fix here, but that doesn't really do anything about your problem). So as far as I can tell, the dask wrapper is standard compliant.

Maybe you should move the discussion to the array API repo to get a better understanding of how scipy should be handling unmaterialized sizes (for instance, at data-apis/array-api#748, or in a new issue).

closed in favour of data-apis/array-api#834. There certainly isn't consensus yet that it is right for consuming libraries to force computation.