data-apis/array-api

Using a neutral format to have lossless interface between multidimensional tools

loco-philippe opened this issue · 5 comments

Do you think that the work described below can be associated with the discussions carried out by data-apis?

I proposed a neutral format for describing and sharing multidimensional data (see jupyter notebook, github repository, PyPI package ).

Its use allows reversible interfaces (round trip without loss) between tools.
The examples discussed are as follows:

  • Xarray (dataset, DataArray)
  • Astropy (NDData)
  • scipp (dataset)
  • JSON format

The notebook shows for example, we can losslessly convert a scipp to a Xarray dataset or convert it to JSON format.

It also handles easily exchangeable lightweight structures (only metadata pointing to URIs to access data stored in independent environments).
Data typing is based on the semantic types defined by the NTV format.

The package is built based on numpy.ndarray.

A second version will integrate tabular representations (integration of the NTV-TAB format and the [NTV-pandas] format ](package https://github.com/loco-philippe/ntv-pandas)) and associated interfaces (for example pandas).

The first version (alpha) of the package will be completed based on the use cases that will be expressed.

Thank you in advance for your feedback (github issues and [discussions](https://github.com/loco-philippe /ntv- numpy/discussions) are enabled)!

Note: This proposal is also shared with affected tools (issues)

@loco-philippe Thank you for reaching out. The NTV format initiative you propose is certainly interesting work; however, I don't think we're likely to take it up at this stage. As a standardization body, we primarily focus on well-established art within the Python ecosystem.

I think your best bet, for the time being, is to continue to engage individual communities (e.g., NumPy, pandas, Xarray, PyTorch, etc), as you are already doing. If the NTV format achieves widespread adoption, it could eventually become a standardization candidate and something in which we'd engage. But given the project's early stages, I think we are a ways out from that.

Also, I/O is the first topic mentioned as explicitly out-of-scope: https://data-apis.org/array-api/latest/purpose_and_scope.html#out-of-scope. We also haven't considered things like Zarr, Parquet & co. So while work on data formats is in general of interest to the community, I think it's not the best fit for this standard.

If it's about the in-memory data exchange, the features needed by Xarray & co that go beyond what DLPack offers (e.g, labeled axes) aren't part of the standard.

@rgommers, @kgryte, Thank you for taking the time to respond to me.

In fact, the proposed topic only concerns the structure of multidimensional and tabular data.

When we compare the data models of the main tools, we observe differences that make the interfaces more complex.

The concepts to which I refer complement those defined at the level of Array-API (dtype, ndim, shape, size) and DataFrame-API (column):

  • ndarray
  • variable / labeled array
  • variable properties (unit, quantities, relationship and distance between two variables)
  • dataset (dataarray, datagroup, container)
  • dataset-dimension (axis)
  • dataset-coordinate (multi-dimension)
  • dataset-variable (and extensions: mask, variance, uncertainty, bins, alignment)
  • dataset-relationship between dataset-variables
  • dataset / dataframe equivalence

It seems to me that we could converge on common concepts which do not call into question the existing implementations and which would facilitate exchanges (the tool developed shows that this convergence is possible and that this gives more complete interfaces than those existing).

My question was rather to know if work was underway on these notions (which seem to me to be within the scope of data-apis) and if not if you think that this could be of interest to data-API.

Have a nice day

My question was rather to know if work was underway on these notions (which seem to me to be within the scope of data-apis) and if not if you think that this could be of interest to data-API.

There is not. All this seems out of scope for the array API standard. I agree it could in principle fit under the Data APIs umbrella, but it's clearly separate from plain arrays/tensors.

Given that the questions are answered, I'll go ahead and close this issue as "interesting, but out of scope for this project". Thanks @loco-philippe for the interest.