e10v/tea-tasting

Help moving away from the pandas backend

Closed this issue ยท 8 comments

๐Ÿ‘‹๐Ÿป Hi there!

I lead the Ibis project, and one of my colleagues @ncclementi pointed out your use of the Pandas backend.

Is there any chance we can convince you to move away from it or help clarify anything that would make it easier to swap in support for something else like DuckDB?

e10v commented

Hi!

For the main use cases, tea-tasting supports different data backends including both DuckDB and Pandas. But there are indeed two cases when I use the Pandas backend specifically:

  1. Generating example datasets. The result can be either a Pandas DataFrame or an Ibis Table with Pandas backend.
  2. Downloading data from a user source into Python. Here I don't actually use Pandas backed but rather serialize data as a Pandas DataFrame.

The main reason I do this is that most users are still more familiar with Pandas:

  1. If they want to play with an example dataset, they can use more familiar to them Pandas API.
  2. tea-tasting allows users to define custom metrics. And if metric is based on granular data, they need to define a method that accepts granular data and return the analysis results. And again, users are familiar with Pandas API, it would be easier for them to define a method.

Another reason is that example dataset is generated using NumPy. And it's just easy to create a Pandas DataFrame from NumPy arrays :) But I guess this problem can be solved.

It would help if you explained the reason why you'd like me to move from the Pandas backend. Do you plan to drop its support in Ibis?

I also would like to understand if you'd like me to move from Pandas only when I'm using Ibis Table or you want me to replace serialization to Pandas DataFrame as well.

Thank you!

It would help if you explained the reason why you'd like me to move from the Pandas backend. Do you plan to drop its support in Ibis?

Yep, we've got a blog post on it that we haven't yet put out on the airwaves. The rationale is explained there.

The dask and pandas backends are deprecated in 9.4 (usable, but you'll get a warning), and we plan to remove them entirely in 10.0.

TL; DR: the pandas backend is strictly worse in terms of functionality and hugely worse in terms of performance versus our other backends that support executing queries against Pandas DataFrames (duckdb, polars, and datafusion).

I also would like to understand if you'd like me to move from Pandas only when I'm using Ibis Table or you want me to replace serialization to Pandas DataFrame as well.

Pandas DataFrames are still supported as input and output (what you're calling serialization), that's not going away.

What's going away is the ability to do things like this:

con = ibis.pandas.connect(...)  # this will be removed entirely in 10.0
con.execute(ibis_expr)

Instead, you should simply call ibis.memtable on your pandas DataFrames and use that with whatever backend you want.

Can you point me to where you're using the pandas backend explicitly?

I'll submit a PR that removes the pandas backend use and we can discuss there!

e10v commented

Yep, we've got a blog post on it that we haven't yet put out on the airwaves. The rationale is explained there.

Oh, thank you, I've missed it. Guess, I have no choice :) Thank you for letting me know!

Pandas DataFrames are still supported as input and output (what you're calling serialization), that's not going away.

Good to know. For the moment, I will continue to use it for output.

Can you point me to where you're using the pandas backend explicitly?

I see you've already found it. But the PR doesn't pass some tests as DuckDB is not in the dependencies yet.

I also want to support Pandas as an input. So, I will have to change the code in other modules. If you don't mind, I will handle it myself. It's just easier for me as I know what I need to change.

If you don't mind, I will handle it myself. It's just easier for me as I know what I need to change.

Roger that! I'll close out my PR.

I also want to support Pandas as an input.

Yep, hopefully it was clear from my message, but I'll reiterate it: supporting pandas as an input and as an output is definitely supported in Ibis as well. Neither of those is going away.

e10v commented

Yep, hopefully it was clear from my message, but I'll reiterate it: supporting pandas as an input and as an output is definitely supported in Ibis as well. Neither of those is going away.

Yes, it was clear, thank you.

e10v commented

Done: #90

It will be available in the next release.

Found some bugs in tests as well ๐Ÿ™ˆ