duckdb/duckdb_iceberg

Question: how to speed up iceberg reads

michaelzwong opened this issue · 1 comments

I'm currently reading from my glue-cataloged iceberg table using the following:

duckdb.sql(
    f"""
           INSTALL httpfs;
           LOAD httpfs;
           set s3_region = 'us-west-2';
           set s3_access_key_id = '{settings.AWS_ACCESS_KEY_ID}';
           set s3_secret_access_key = '{settings.AWS_SECRET_ACCESS_KEY}';
           INSTALL iceberg;
           LOAD iceberg;
           """
)
res = duckdb.execute(
	  "SELECT * FROM iceberg_scan('s3://foopath) LIMIT 100"
)

The execution is very slow compared to just reading from the .parquet files at the same path (eg. 2 minutes vs 2 seconds).

res = duckdb.execute(
	  "SELECT * FROM parquet_scan('s3://foopath/*.parquet) LIMIT 100"
)

Would like to know what I'm doing wrong or if someone has a solution

Hi,

I would first suggest to execute the query with 'explain analyze' and post the results here.
The cause might be issue #2, where more parquet files are scanned than necessary.