data-apis/dataframe-api

How to note that some methods aren't available for lazy dataframes?

MarcoGorelli opened this issue ยท 27 comments

Here are some examples which don't work raise with polars lazyframes:

  • DataFrame.shape (the length isn't, and can't be, known)
  • Column.__len__ (same as above)
  • comparisons (e.g. df1 == df2, but there may be a way around this)
  • anything column-related (as there's no polars.LazyColumn - but there may be a way around this)

I'd like to think we can work round points 3 and 4. But not 1 and 2, not really sure how that could work

I know we want everything to be independent of the execution engine, but how do we plan for these to work for lazy engines?

There was a discussion in the Array API repo a while back that was somewhat similar. It was about things like __bool__ and __float__ and how to deal with these for lazy array libraries. data-apis/array-api#642 (comment) concluded that it was Ok that if a user accessed those a computation was forced.

From the point of view of someone using the dataframe API to write a tool that consumes dataframes (lazy or eager) you don't want to have to distinguish in your code whether a dataframe is lazy or not. It makes me think that if a user wants to know the shape or len, then they want to know the shape/len now. So a lazy dataframe would have to go away and evaluate enough to be able to answer that query. A bit like with bool(some_array), I want to know the answer now, not at some point later.

@MarcoGorelli what does "don't work" mean? Does it force evaluation, raise an Exception, or ....? I think either choice is valid - e.g., Dask will force evaluation when needed, while the comment @betatim linked to had an implementation that needed to be 100% lazy and hence had to raise.

Also, I think DataFrame.shape() is not a good example - I don't see a reason why that cannot remain lazy. The reason bool(), int() et al. couldn't stay lazy is because the Python language itself will not allow returning anything other than actual bool/int/etc. instances. For .shape(), it's perfectly okay to return a lazy object.

xref gh-195 for other lazy-implementation-specific questions

Apologies, by "doesn't work" I meant that it raises an exception

I think I'd be -1 on forcing computation for .shape (or anything else). Right now, if you have a polars.LazyFrame, then everything is guaranteed to be practically zero-cost up until you hit .collect(). If some operations can trigger potentially very expensive operations, then that guarantee is broken

So the alternatives would be:

  • raise for .shape
  • "For .shape(), it's perfectly okay to return a lazy object."

For the latter, what would the lazy object be? A lazy tuple? Such a thing doesn't exist in polars. If it's something which exists only in the standard, then presumably there needs to be some way to materialise it?

Anyway, I think this highlights that there's need for discussion on this topic. But that's what we're here for! Fortunately it's in the agenda for tonight's call

I think my general stance is:

  • no method should force computation for lazy engines
  • some methods may raise if used on lazy objects, and we should document this

For the latter, what would the lazy object be? A lazy tuple? Such a thing doesn't exist in polars.

It's not hard to build though, so I don't think there's a fundamental problem there. Polars can add it if desired.

If it's something which exists only in the standard, then presumably there needs to be some way to materialise it?

The standard does not have anything that's specified as being lazy. It only has a tuple return type, with the understanding that it may be any other object that duck types as a tuple.

Materialization is library-specific. I'd expect that if a Polars dataframe has a .collect(), then any other lazy objects that may exist in Polars would also have a .collect() (but if it's something else, then that's equally fine).

Sure - and if polars didn't want to add LazyFrame.shape, we'd be OK with just raising NotImplementedError in its standard-compliant implementation?

Alternatively for things like shape and length which are data dependent, we could do something like allow returning NULL or -1 or something to indicate that the value is unknown?

For object equality comparisons I'm not sure they're even a part of the standard yet? But in general for APIs that return Scalars that could possibly be used for control flow what would be the experience for a lazy implementations if only an explicit API can force materialization? I.e. what would the experience of doing something like if df.get_column_by_name('a').any(): ...?

For object equality comparisons I'm not sure they're even a part of the standard yet?

there is DataFrame.__eq__:

def __eq__(self, other: DataFrame | Scalar) -> DataFrame: # type: ignore[override]
"""
Compare for equality.
Nulls should follow Kleene Logic.
Parameters
----------
other : DataFrame or Scalar
If DataFrame, must have same length and matching columns.
"Scalar" here is defined implicitly by what scalar types are allowed
for the operation by the underling dtypes.
Returns
-------
DataFrame
"""
...

But in general for APIs that return Scalars that could possibly be used for control flow what would be the experience for a lazy implementations if only an explicit API can force materialization? I.e. what would the experience of doing something like if df.get_column_by_name('a').any(): ...?

Yeah it's tricky. I really think this needs discussion and design

Staying with polars as an example - indeed, what would df.get_column_by_name('a').any() return?

  • In the eager case, it returns a Python scalar
  • in the lazy case, we could say that it returns some lazy scalar

But crucially, there doesn't exist a lazy scalar in polars, and I don't think there would be much appetite for adding one.

We could work around this by returning a LazyScalar. And we could give it a .collect method, for compatibility with the rest of polars. But LazyScalar isn't part of the polars docs, so users wouldn't know what to do with it. And if no .collect method is part of the Standard either, then concretely what could users do with such a LazyScalar?

This is why I'm suggesting that some methods raise NotImplementedError for lazy implementations, and that we should document for which ones that would be appropriate.
For example, we could say that DataFrame.sum must always be supported by any standard-compliant implementation, but that DataFrame.shape is allowed to raise for lazy engines

Comment for reference:

Lazyframe don't have a formalized shape. I don't think they should support [equality comparison]

pola-rs/polars#10274 (comment)

For the latter, what would the lazy object be? A lazy tuple?

This opens up a whole can of worms if we want to have lazy versions of future-materializable items. Lazy scalars? Does df.lazy().select('a')[0] return a materializable?

In polars that would raise (as it would in the dataframe standard - which, just for reference, is the repo this issue is in, just in case you came here by accident ๐Ÿ˜„ And if not, welcome, good to have you!)

Today's call touched on a lot of topics, including forced materialisation in operations used in control flow, user-experience, methods like collect

There wasn't a concrete conclusion (other than that this needs more discussion), but we did say we want to punt on it and get it working for eager libraries

I'd suggest that for now we at least document what might methods might have implementation-dependent behaviour for lazy engines, so that if you want to write library-agnostic code, then as long as:

  • you know your inputs are going to be eager (say, because that's what you document that you support);
  • OR you only use the methods which are not marked as implementation-dependent for lazy engines.

then you know your code will run fine

Something I did want to share, with regards to forced materialisation, was this post by Liam Brannigan: https://www.linkedin.com/feed/update/urn:li:activity:7079836009755996160?utm_source=share&utm_medium=member_desktop

One way to do this in Polars and Pandas is the .upsample method. This method inserts the missing time steps.

However, upsample isn't part of the lazy API in Polars and so we have to exit lazy mode to do upsample in eager mode.

But...we can do this another way where we stay in lazy mode and can still stream large datasets.

We do this by first defining a DataFrame with a time column that has all the time steps. We then do a left join with the gappy dataframe to get the data. With the rows added we can then handle the missing values - in this case by interpolating the values column.

The good news is that by re-thinking how we would do this in Pandas we can stay lazy and use streaming for large datasets.

By not implementing something in lazy mode, polars forced him to rethink his approach, and to write more efficient code.

If, conversely, polars had materialised the data under-the-hood for him, then he would have ended up write less-efficient-than-possible code

Also, I think DataFrame.shape() is not a good example - I don't see a reason why that cannot remain lazy. The reason bool(), int() et al. couldn't stay lazy is because the Python language itself will not allow returning anything other than actual bool/int/etc. instances. For .shape(), it's perfectly okay to return a lazy object.

I agree that Python "forced our hand" for bool and co. However, I'd still have preferred the current outcome even if Python hadn't forced us. The reason being that I think being able to write array/dataframe consuming code without having to differentiate lazy/eager (in control flow) is a big bonus. If you can get it.

Maybe shape() isn't a good example, maybe I"m not creative enough to think of use-cases for it (or similar things) where you would access it and not immediately want to use the actual value. You can and should discuss whether or not a user should be accessing it, because it causes computation, but users be users. Sometimes they want to do things even if they are costly.

First, thanks all for the productive discussions, both here and on yesterday's call ๐Ÿ™Œ I'll add some thoughts, and summarise/re-iterate my position.

Update on equality comparisons in polars-lazy: they're explicitly forbidden and now raise an exception: pola-rs/polars#10274
Same considerations would apply to .shape / column comparisons and reductions / etc.
It's a very deliberate design decision to not support these in polars-lazy, it's not just something which hasn't been done yet. We can't easily work around these by returning our own lazy objects (but counterexamples are welcome, I'm ready to admit I might be wrong here)

So, where do we go with the polars implementation of the standard? Possible solutions which come to mind are:

    1. force computation for operations intentionally not supported by polars.LazyFrame
    1. add a .collect method to the standard, which could be a no-op for eager libraries
    1. document which parts of the standard may not be supported for lazy engines

To expand on why I'm currently against option 1:

First, it goes against Polars' very-well-thought-out design
Second, it's a performance footgun. Not allowing some lazy operations can "force" users to rethink their queries and write them in a more performant manner.
Third, it would break predictability of the standard: if I write a function function_which_transforms_dataframe(df: DataFrame) -> DataFrame using the standard, then I'll have no way of knowing whether the return value will be of the same type as the input. I might start with pl.LazyFrame but end up with pl.DataFrame.

On the other hand, both options 2 and 3 seem like a clear improvement over the current status-quo, so I'd welcome them

I'm open to changing my mind though, happy to admit that I might be wrong

From the point of view of someone using the dataframe API to write a tool that consumes dataframes (lazy or eager) you don't want to have to distinguish in your code whether a dataframe is lazy or not.

Part of the Polars design philosophy is that you shouldn't accidentally trigger expensive computations. Calling .shape (or any method requiring materialization) on an eager dataframe is trivially cheap. Doing the same on a lazy dataframe could be extremely costly.

That's where .collect comes in - a way to explicitly opt-in to materialization and paying the associated cost.

Any consumer - be it a user or a library - should definitely care to distinguish whether a dataframe is lazy or not. If the functionality you are writing requires a materialized dataframe, I would suggest not accepting lazy dataframes as input.

Calling .shape (or any method requiring materialization) on an eager dataframe is trivially cheap.

Pointing out that the same is true for the width of a LazyFrame: perhaps we should allow that?

Thanks both for your inputs!
For anyone who hasn't met them yet - Stijn works on Polars full-time, and Marshall is a regular contributor / power user

Regarding number of columns - yes, and that's already allowed:

  • in polars: len(some_lazy_df.columns)
  • in the consortium's standard: len(some_lazy_df.get_column_names())

users be users. Sometimes they want to do things even if they are costly

Yup, totally agree - I'm all for allowing users to do costly things, so long as they intentionally opt-in to them

At the dataframe summit, and I've chatted about this with some people from other libraries. Looks like Dask actually raises here, rather than triggering computation:

In [1]: import dask.dataframe as dd
   ...: # from dask.datasets import timeseries
   ...:
   ...: pdf = pd.DataFrame({"x": [1, 2, 3], "y": 1})
   ...: df = dd.from_pandas(pdf, npartitions=2)
   ...: if (df.x.mean() > 0):
   ...:     # do something
   ...:     pass
   ...: else:
   ...:     # do something else
   ...:     pass
   ...:
---------------------------------------------------------------------------

TypeError: Trying to convert dd.Scalar<gt-2f5a..., dtype=bool> to a boolean value. Because Dask objects are lazily evaluated, they cannot be converted to a boolean value or used in boolean conditions like if statements. Try calling .compute() to force computation prior to converting to a boolean value or using in a conditional statement.

They also advise calling .compute()

So, for compatibility with both dask.dataframe and polars (lazy), I'm renewing my suggesting of not forcing computation on bool(scalar) and instead adding DataFrame.collect to the Standard

Maybe shape() isn't a good example, maybe I'm not creative enough to think of use cases for it (or similar things) where you would access it and not immediately want to use the actual value.

We are working on building a lazy data frame library based on ONNX in parallel to the lazy array library mentioned in data-apis/array-api#642. The point of ONNX is to create and serialize a computational graph ahead of time; i.e. before any eager values are at all available. It is therefore impossible for our use case to ever eagerly compute anything even if we wanted to.

Using our library, it would be reasonable (albeit a bit contrived) to create a computational graph (and export it to an ONNX model) that ultimately outputs the shape of a data frame. The same would apply to other scalar values such as df.x.mean() > 0 from the above example.

The essence of this comment is that we need the abstraction of a lazy/duck-typed shape-tuples and scalars in the standard in order to cover our use case. I think many of the points raised in data-apis/array-api#642 carry over to this discussion including the point that bool(df.x.mean() > 0) should be allowed to raise in standard compliant implementations.

Thanks all for discussions

Can we agree that the dataframe api should be kind of like a "zero-cost-abstraction"? In the sense that it if you can write something using the standard api, then it should be just as performant as if you'd used the original library.

Because I don't think the current design of the standard api allows that. Here's an example I have in mind:

def my_agnostic_plotting_function(df):
    df = df.__dataframe_consortium_standard__()
    df = (
        big_computation_1(df)
        .big_computation_2(df)
        .big_computation_3(df)
    )
    plot_df(df)

If I was writing a library which accepts a polars DataFrame or LazyFrame, I would write it as follows:

def my_polars_plotting_function(df):
    if isinstance(df, pl.DataFrame):
        df = df.lazy()
    df = (
        big_computation_1(df)
        .big_computation_2(df)
        .big_computation_3(df)
        .collect()
    )
    plot_df(df)

which would be more performant. This is because it would still use query optimisation for the chaining of big_computation_1, big_computation_2, and big_computation_3.
If we don't have DataFrame.lazy nor DataFrame.collect in the standard, then I can't write my code as efficiently as I could have if I'd used polars directly - and I think that'd be a pity

On the other hand, if DataFrame.lazy and DataFrame.collect were available, I could just have written

def my_agnostic_function(df):
    df = df.__dataframe_consortium_standard__()
    df = (
        df.lazy()
        .big_computation_1(df)
        .big_computation_2(df)
        .big_computation_3(df)
        .collect()
    )
    plot_df(df)

I discussed collect with @betatim at EuroScipy, who said he'd be OK with it being added and being a no-op for eager engines, so he wouldn't need to special-case his code. Tim - please do correct me if I'm misquoting you or not accurately representing your position, thanks ๐Ÿ™

I think as a dataframe consuming library (like scikit-learn) it would be annoying to have to write different code for lazy and eager dataframes. So the thing I care about is being able to write code that works with both. I have no (educated) opinion of whether this means an eager dataframe should behave like a lazy one (aka you call .collect() and it is a no-op) or the other way around.

For my education: If all dataframes behave as if they were lazy, why do we need the df.lazy() in the example?

Thanks for explaining

In that case, because the user might pass an eager dataframe to your library. If you wanted to do several computations on it, it would be more efficient to do them lazily. So you either:

  • only allow lazy dataframes to be passed to your function (but we have no way of specifying that in the standard)
  • do something like .lazy / .collect in your code
  • write your code in a less-than-ideal manner

The other discussion point was what to do about lazy columns / column reductions. I have a suggestion here which aims to address that: #229

In that case, because the user might pass an eager dataframe to your library.

My expectation would be that df_ = df.__dataframe_consortium_standard__() means df_ is a lazy dataframe (or something that behaves like one), no matter what the user passed in. If the mantra is "treat them all as lazy" it seems unneccessary to have:

df = df.__dataframe_consortium_standard__()
df = df.lazy()

I'd always put these two lines together, so might as well combine them

Hmm, I do like the sound of that! And namespace.dataframe_from_sequence should probably be lazy too

We may have a way forwards - or at least, a proposal. See #249

I'll just raise in the pandas/polars implementations, no big deal - alternative suggestions are welcome