skrub-data/skrub

`drop_nulls` and `is_null` have a different behavior for polars and pandas

Closed this issue · 1 comments

Problem Description

Since polars and pandas handle null values differently, we need to choose which way to go when dispatching functions dealing with missing values.

Currently, we do not have the same behavior for both in the case where NaNs are present. For instance, our pandas implementation of drop_nulls drops everything i.e. "True" null values (None) and NaNs, while our polars implementation only drops "True" nulls.

Snippet to reproduce:

import polars as pl
import pandas as pd
import numpy as np
from skrub._dataframe import _common as ns

data = {"a": ["this", "is", None], "b": [1, np.nan, 3.]}
df_pd = pd.DataFrame(data)
df_pl = pl.DataFrame(data)

ns.drop_nulls(df_pd) # has 1 row
ns.drop_nulls(df_pl) # has 2 rows

We have a similar behavior with is_null which flags every null for pandas and only "True" nulls for polars.

This is part of a larger discussion when backends behave differently. This is something that will happen with other dispatched functions, and I'm wondering if we want to enforce both dispatched functions to yield results from the "less expressive" framework i.e. pandas in our case (or something else if/when we support other backends someday). If we do, we lose some enhancements proposed by certain backends, but if we don't, we must account for these different behaviors in the codebase.

WDYT ?

Feature Description

We could force them having the same behavior with something like

@is_null.specialize("polars")
def _is_null_polars(col):
    if is_float(col):
        return col.is_null() | col.is_nan()
    else:
        return col.is_null()

and

@drop_nulls.specialize("polars")
def _drop_nulls_polars(col):
    return col.drop_nulls().drop_nans()

Alternative Solutions

No response

Additional Context

No response

Thanks, that's a good point! As we discussed IRL, let us treat NaNs as nulls everywhere, as is done in fill_null -- the fact that it hasn't been done for the others is an overlook.
Regarding the larger discussion, I'm not sure we can say something general maybe those are decisions we'll need to make on a case-by-case basis