Poor performance when writing deltalake
djouallah opened this issue ยท 9 comments
What happened?
I am comparing the performance of some Python Engines for an ETL scenario, Ibis performs is very problematic, I attached a reproducible example, the data is stored in Fabric Onelake and run using a single node spark notebook
https://github.com/djouallah/Light_ETL_Challenge/blob/main/Light_ETL_Challenge.ipynb
What version of ibis are you using?
ibis_framework-9.1.0
What backend(s) are you using, if any?
DuckDB
Relevant log output
No response
Code of Conduct
- I agree to follow this project's Code of Conduct
@djouallah thanks for reporting this. we are looking into it and will get back to you shortly.
Changed the title to reflect the actual problem here. The difference in performance between duckdb native and Ibis is reducible to the to_delta
call.
The to_delta
method uses record batches to reduce the likelihood that the write call doesn't run out of RAM. Passing record batches directly from duckdb appears to be anywhere from 2-5x slower depending on which engine
is passed to write_deltalake
(engine="pyarrow"
tends to be faster than engine="rust"
).
I'll open up an issue on the DuckDB tracker about this.
it is not, I changed the code to use arrow table and it is still 2 x slower than duckdb
You should also change that loop over ibis columns to happen in a single call
DUNIT = DUNIT.cast({col_name: "double" for col_name in df_cols})
instead of
for col_name in df_cols:
DUNIT = DUNIT.cast({col_name: "double"})
There's some overhead to the latter, but it should be in the range of 10-15%, not 100% ๐
I suspect the spark runtime is doing something here !!
I don't think Spark is involved at all in the comparison between DuckDB and Ibis, unless I'm missing something.
turn out the issue is with ipython, run it with version 1.3 and the performance is nearly the same as duckdb
https://learn.microsoft.com/en-us/fabric/data-engineering/runtime-1-3
Going to close this out, thanks all!