data-apis/array-api

RFC: `item()` to return scalar for arrays with exactly 1 element.

randolf-scholz opened this issue · 8 comments

def item(self) -> Scalar:
     """If array contains exactly one element, retun it as a scalar, else raises ValueError."""

Examples:

Demo:

import pytest
import xarray as xr
import pandas as pd
import polars as pl
import numpy as np

@pytest.mark.parametrize("data", [[], [1, 2, 3]])
@pytest.mark.parametrize(
    "array_type", [torch.tensor, np.array, pd.Series, pd.Index, pl.Series, xr.DataArray]
)
def test_item_valueerror(data, array_type):
    array = array_type(data)
    with pytest.raises(ValueError):
        array.item()


@pytest.mark.parametrize(
    "array_type", [torch.tensor, np.array, pd.Series, pd.Index, pl.Series, xr.DataArray]
)
def test_item(array_type):
    array = array_type([1])
    array.item()

Currently, only torch fails, because it raises RuntimeError instead of ValueError.

This was discussed in #710 , along with the more general to_list, which works also for ND arrays.

item() is a bit different from to_list, and honestly I find it confusing that a method named to_list can return something that is not a list.

.item() is more constrained than to_list indeed, and a bit cleaner. I checked other libraries - NumPy, PyTorch, JAX and CuPy implement .item(), Dask does not. (TF doesn't have it in the docs, so probably also not - but I can't check). CuPy/JAX do the transfer to CPU if the ndarray is on GPU.

This is a minor convenience method though, since float() & co work as well. They are clearer, since type-stable, and it also work for Dask. The only downside is that if you want some dtype-generic implementation to return a single element, you have to write a little utility for it to call int/float/complex/bool as appropriate. Something like:

def as_pyscalar(x):
    if xp.isdtype(x, 'real floating'):
        return float(x)
    elif xp.isdtype(x, 'complex floating'):
        return complex(x)
    elif xp.isdtype(x, 'integral'):
        return int(x)
    elif xp.isdtype(x, 'bool'):
        return bool(x)
    else:
        # raise error, or handle custom/non-standard dtypes if desired

Static typing of such a function, and of .item(), would also be a little annoying as it requires overloads.

item also works on arrays with multiple dimensions, whereas we decided to make it so float does not.

>>> np.array([1]).item()
1

We discussed this in a call today, and concluded that this fell into a bucket of functionality that is useful, but also easy to implement on top of what's already in the standard. In addition, there are problems with trying to add this: a item() method is hard, because it's missing in some libraries and missing methods cannot be worked around in array-api-compat. If we'd do this, a function would be the way to go - but since that's not present in any libraries, it'd be new - hence more work, and likely to incur resistance from array library maintainers.

Outcome:

  1. Create the array-api-extra package where this kind of function can live, and add it there (probably as as_pyscalar or a similarly descriptive name, not as item)
  2. Only reconsider adding it to the standard itself in the future if most/all array libraries have already added that function.

On a very fundamental level, I believe .item() makes no sense on DataFrame-like objects (pandas.DataFrame, polars.DataFrame, pyarrow.Table, etc.) because these are designed to represent heterogeneous data types.

From a mathematical PoV, item() acts on array-like data with homogeneous type, as a representation of the natural isomorphism V →K, when V is a 1-dimensional vector space over K.

Is this usage guaranteed?

If so, should it be added somewhere to the specification? I looked for it here.

FWIW I also like the item method since it's all I've ever needed and it's simpler than tolist. I wonder if it should be on the array namespace rather than the array: (def item(x: Array, /) -> complex | bool) since it can be implemented using the array's public interface. (This is a common test in OO design for what should be a method versus a bare function.)

Yes, __float__ and so on are guaranteed (modulo the "lazy" note). See https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__float__.html#array_api.array.__float__. Though Ralf's helper should also include a if x.ndim != 1 or x.size != 1: raise ValueError check.