Poor performance when writing deltalake

Question

Poor performance when writing deltalake

djouallah opened this issue 6 months ago · 9 comments

What happened?

I am comparing the performance of some Python Engines for an ETL scenario, Ibis performs is very problematic, I attached a reproducible example, the data is stored in Fabric Onelake and run using a single node spark notebook
https://github.com/djouallah/Light_ETL_Challenge/blob/main/Light_ETL_Challenge.ipynb

What version of ibis are you using?

ibis_framework-9.1.0

What backend(s) are you using, if any?

DuckDB

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

Answer 1 · 2024-06-20T01:53:06.000Z

@djouallah thanks for reporting this. we are looking into it and will get back to you shortly.

Answer 2 · 2024-06-20T13:54:09.000Z

Changed the title to reflect the actual problem here. The difference in performance between duckdb native and Ibis is reducible to the to_delta call.

The to_delta method uses record batches to reduce the likelihood that the write call doesn't run out of RAM. Passing record batches directly from duckdb appears to be anywhere from 2-5x slower depending on which engine is passed to write_deltalake (engine="pyarrow" tends to be faster than engine="rust").

I'll open up an issue on the DuckDB tracker about this.

Answer 3 · 2024-06-20T13:57:17.000Z

it is not, I changed the code to use arrow table and it is still 2 x slower than duckdb

Answer 4 · 2024-06-20T14:38:04.000Z

I don't see this locally.

Did you run this multiple times and take the minimum duration?

Answer 5 · 2024-06-20T14:49:43.000Z

You should also change that loop over ibis columns to happen in a single call

DUNIT = DUNIT.cast({col_name: "double" for col_name in df_cols})

instead of

for col_name in df_cols:
    DUNIT = DUNIT.cast({col_name: "double"})

There's some overhead to the latter, but it should be in the range of 10-15%, not 100% 😅

Answer 6 · 2024-06-20T14:51:28.000Z

I suspect the spark runtime is doing something here !!

Answer 7 · 2024-06-20T14:57:19.000Z

I don't think Spark is involved at all in the comparison between DuckDB and Ibis, unless I'm missing something.

Answer 8 · 2024-06-20T19:52:10.000Z

turn out the issue is with ipython, run it with version 1.3 and the performance is nearly the same as duckdb

https://learn.microsoft.com/en-us/fabric/data-engineering/runtime-1-3

Answer 9 · 2024-06-21T12:08:23.000Z

Going to close this out, thanks all!