ibis-project/ibis

Poor performance when writing deltalake

djouallah opened this issue ยท 9 comments

What happened?

I am comparing the performance of some Python Engines for an ETL scenario, Ibis performs is very problematic, I attached a reproducible example, the data is stored in Fabric Onelake and run using a single node spark notebook
https://github.com/djouallah/Light_ETL_Challenge/blob/main/Light_ETL_Challenge.ipynb

What version of ibis are you using?

ibis_framework-9.1.0

What backend(s) are you using, if any?

DuckDB

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

@djouallah thanks for reporting this. we are looking into it and will get back to you shortly.

Changed the title to reflect the actual problem here. The difference in performance between duckdb native and Ibis is reducible to the to_delta call.

The to_delta method uses record batches to reduce the likelihood that the write call doesn't run out of RAM. Passing record batches directly from duckdb appears to be anywhere from 2-5x slower depending on which engine is passed to write_deltalake (engine="pyarrow" tends to be faster than engine="rust").

I'll open up an issue on the DuckDB tracker about this.

it is not, I changed the code to use arrow table and it is still 2 x slower than duckdb

I don't see this locally.

image

Did you run this multiple times and take the minimum duration?

You should also change that loop over ibis columns to happen in a single call

DUNIT = DUNIT.cast({col_name: "double" for col_name in df_cols})

instead of

for col_name in df_cols:
    DUNIT = DUNIT.cast({col_name: "double"})

There's some overhead to the latter, but it should be in the range of 10-15%, not 100% ๐Ÿ˜…

I suspect the spark runtime is doing something here !!

I don't think Spark is involved at all in the comparison between DuckDB and Ibis, unless I'm missing something.

turn out the issue is with ipython, run it with version 1.3 and the performance is nearly the same as duckdb

image

https://learn.microsoft.com/en-us/fabric/data-engineering/runtime-1-3

Going to close this out, thanks all!