data-apis/dataframe-api

Require `api_version` argument in `__dataframe_standard__` rather than `__dataframe_namespace__`?

MarcoGorelli opened this issue · 0 comments

Reminder: what these magic methods are

Currently, the way to convert a non-compliant dataframe to a compliant one is by calling df.__dataframe_standard__:

Libraries which implement the Standard in a separate namespace
are required to provide the following methods:
- ``__dataframe_standard__``: used for converting a non-compliant dataframe to a compliant one;
- ``__column_standard__``: used for converting a non-compliant column to a compliant one.

Once you've got a compliant dataframe, you can get the namespace with __dataframe_namespace__

def __dataframe_namespace__(
self, /, *, api_version: str | None = None
) -> Any:
"""
Returns an object that has all the dataframe API functions on it.
Parameters
----------
api_version: Optional[str]
String representing the version of the dataframe API specification
to be returned, in ``'YYYY.MM'`` form, for example, ``'2023.04'``.
If it is ``None``, it should return the namespace corresponding to
latest version of the dataframe API specification. If the given
version is invalid or not implemented for the given module, an
error should be raised. Default: ``None``.

Why __dataframe_standard__ needs api_version:

Take the following example:

def remove_outliers(df, column):
    # Get a Standard-compliant dataframe.
    df_standard = df.__dataframe_standard__(api_version="2023.07")
    # Use methods from the Standard specification.
    col = df_standard.get_column_by_name(column)
    z_score = (col - col.mean()) / col.std()
    df_standard_filtered = df_standard.get_rows_by_mask((z_score > -3) & (z_score < 3))
    # Return the result as a dataframe from the original library.
    return df_standard_filtered.dataframe

I'm not using __dataframe_namespace__ here, so the only way I have of asking for a certain api_version of the standard is via __dataframe_standard__

Why __dataframe_namespace__ probably doesn't need api_version

Say I do

df_standard = df.__dataframe_standard__(api_version="2023.07")
namespace = df_standard.__dataframe_namespace__()

Then it seems natural that the namespace returned would be the "2023.07" one. So it doesn't need repeating in __dataframe_namespace__