data-apis/dataframe-api

Rename entrypoint to `__consortium_api__`?

Closed this issue · 7 comments

If #308 goes in, then the return value of Column.get_value will change. It will no longer be a Python scalar, but a Scalar

This means I'll have to update the tests in pandas/Polars:

https://github.com/pandas-dev/pandas/blob/f777e67d2b29cda5b835d30c855b633269f5e8e8/pandas/tests/test_downstream.py#L340-L344

I'll change it to something much simpler that realistically will never break, like asserting something about result.name

If I'm going to have to change things upstream, I'd like to take the chance the rename the entrypoint

__dataframe_consortium_standard__ is just...long. Originally we'd suggested __dataframe_standard__, but Brock correctly pointed out that this has normative connotations

We're starting to get positive responses (see koaning/scikit-lego#597, skrub-data/skrub#786), so the time to make changes is running out

My hope is that this would then need to be the last upstream update. The rest, we can handle here / in dataframe-api-compat

Slightly dreading starting the conversation though, and the downside is that the minimum pandas version supported by the standard would have to rise to 2.2

An alternative could be that in dataframe-api-compat I just make a decorator, so people can write df-agnostic functions like this:

from typing import Any

from dataframe_api_compat import dataframe_api


@dataframe_api(api_version='2023.11-beta')
def my_dataframe_agnostic_function(df: DataFrame) -> Any:
    for column_name in df.column_names:
        new_column = df.col(column_name)
        new_column = (new_column - new_column.mean()) / new_column.std()
        df = df.assign(new_column.rename(f'{column_name}_scaled'))

    return df.dataframe

Then we don't need to bother pandas, and this looks pretty clean anyway

Folks may not want to take on the dataframe-api-compat package as a dependency, even given it's small, pure python, and vendorable.

I have no objections to the name change other than it may be a bit confusing when working across arrays, dataframes, and other future types that may have efforts to standardize APIs.

We should probably also have our spec include this dunder method as part of the DataFrame, Column, and maybe Scalar classes?

It's already mentioned here:

The signatures should be (note: docstring is optional):
```python
def __dataframe_consortium_standard__(
self, *, api_version: str
) -> Any:
def __column_consortium_standard__(
self, *, api_version: str
) -> Any:
```
`api_version` is
a string representing the version of the dataframe API specification
to be returned, in ``'YYYY.MM'`` form, for example, ``'2023.04'``.
If the given version is invalid or not implemented for the given module,
an error should be raised. It is suggested to use the earliest API
version required for maximum compatibility.

I don't think DataFrame / Column / Scalar need it, this is just the entry-point for going for "non-necessarily-standard-compliant" to "standard-compliant"

If you have a DataFrame as defined in our spec, it's already standard-compliant, and you'd have no need to call __dataframe_consortium_standard__ on it

I don't think DataFrame / Column / Scalar need it, this is just the entry-point for going for "non-necessarily-standard-compliant" to "standard-compliant"

If you have a DataFrame as defined in our spec, it's already standard-compliant, and you'd have no need to call __dataframe_consortium_standard__ on it

If I get an arbitrary dataframe as input and I want to confirm it's standard-compliant, how do I do that today? In my mind the easiest way would be to have standard-compliant classes implement __dataframe_consortium_standard__ that return self.

there's __dataframe_namespace__ for that

there's __dataframe_namespace__ for that

That returns the namespace and not a compliant dataframe object. So the code would end up looking like:

def get_compliant_dataframe(df):
    if hasattr(df, "__dataframe_namespace__"):
        return df
    else:
        return df.__dataframe_consortium_standard__(...)

It feels a bit clunky but I guess it's not too bad?

It feels a bit clunky but I guess it's not too bad?

yeah, and as Ralf said, in the end, people will probably just write their own helper functions

might as well close then, this isn't too bad