Outline
asmeurer opened this issue · 7 comments
Here's the outline from the talk proposal (I've also uploaded it here https://github.com/data-apis/scipy-2023-presentation/blob/main/outline.md)
So the first question is if there's anything that we should add for the paper.
- A motivating example, adding array API standard usage to a real-world scientific data analysis script so it runs with CuPy and PyTorch in addition to NumPy.
- History of the Data APIs Consortium and array API specification.
- The scope and general design principles of the specification.
- Current status of implementations:
- Two versions of the standard have been released, 2021.12 and 2022.12.
- The standard includes all important core array functionality and extensions for linear algebra and Fast Fourier Transforms.
- NumPy and CuPy have complete reference implementations in submodules (numpy.array_api).
- NumPy, CuPy, and PyTorch have near full compliance and have plans to approach full compliance
- array-api-compat is a wrapper library designed to be vendored by consuming libraries like scikit-learn that makes NumPy, CuPy, and PyTorch use a uniform API.
- The array-api-tests package is a rigorous and complete test suite for testing against the array API and can be used to determine where an array API library follows the specification and where it doesn’t.
- Future work
- Add full compliance to NumPy, as part of NumPy 2.0.
- Focus on improving adoption by consuming libraries, such as SciPy and scikit-learn.
- Reporting website that lists array API compliance by library.
- Work is being done to create a similar standard for dataframe libraries. This work has already produced a common dataframe interchange API.
One question is how much of the talk and/or paper should be spent discussing the dataframe work. The most significant work that's been done so far is on the array side, but I want to make sure that dataframes also get a fair share of the discussion.
One question is how much of the talk and/or paper should be spent discussing the dataframe work. The most significant work that's been done so far is on the array side, but I want to make sure that dataframes also get a fair share of the discussion.
I added this to the agenda for tomorrow's dataframe call - let's see what everyone there thinks.
I've fleshed out the outline here https://github.com/data-apis/scipy-2023-presentation/blob/main/outline.md. I plan to use that as the basis for the paper. Let me know if you have any thoughts or suggestions.
That looks pretty good! For the paper I think that indeed you want to focus on what has already been done and is complete. And in the presentation, spend a bit more time on dataframes, and what's next for SciPy, scikit-learn & beyond, and how people can help or adopt.
Thanks for leading the effort, Aaron! The outline looks good to me.
Two minor nits just FYI 🙂
Execution semantics are out of scope. This includes single-threaded vs. parallel execution, task scheduling and synchronization, eager vs. delayed evaluation, performance characteristics of a particular implementation of the standard, and other such topics.
I am sure Dask/cuNumerics would argue that __dlpack__
assumes the array fits in single node and not distributed. We don't yet have any zero-copy exchange protocol for distributed arrays. Given the page limit we probably don't want to mention such details, so just raising it here for the record.
Standardization of these dtypes is out of scope: bfloat16, complex, extended precision floating point, datetime, string, object and void dtypes.
"complex" shouldn't be listed as out-of-scope.
I agree with Ralf on focusing on what we already have done. For future work, we should avoid promising any plan that could have slight uncertainty to be abandoned. We can just vaguely say that we want to listen to the community feedback and include increase the API surface as needed, etc.
Also, when mentioning API compliance, we probably want to go with an inclusive tone. Some libraries like Dask cannot do __dlpack__
by nature, for example, but we still want to encourage libraries to adopt as much as they can, without worrying about looking bad compared to other libraries. It shouldn't be framed as a competition, at least not on our paper.
I copied some of these things from https://data-apis.org/array-api/latest/purpose_and_scope.html. We should go through that page and update it as some of the things written there are out of date.
Also, when mentioning API compliance, we probably want to go with an inclusive tone. Some libraries like Dask cannot do dlpack by nature, for example, but we still want to encourage libraries to adopt as much as they can, without worrying about looking bad compared to other libraries. It shouldn't be framed as a competition, at least not on our paper.
I can add a sentence or two about this. There's also pytorch which doesn't have unsigned integer types other than uint8. Hopefully we can also discuss Athan's compliance tracking site and mention this there.