pyconhk-2024

https://duckdb.org/docs/extensions/tpch.html

Installtion

Install duckdb:

brew install duckdb
duckdb

export JAVA_HOME=/usr/libexec/java_home -v 1.8 # Need to run this on M2

run it with Docker to avoid Java installation on M2. docker run -it -v "$(pwd)":/opt/spark/app bitnami/spark:3.3 /bin/bash

Create the TPCH benchmark:

INSTALL tpch;
LOAD tpch;
CALL dbgen(sf = 1);

You can change the scaling factor to a larger number for heavier workload.

Why ibis?

If you are Python User:
If you are SQL User, write once, run everywhere.
Efficient testing (Spark in production, duckdb for local test)
Simply more efficient, and fast

Datalake, Composable Open Data Stack

Storage decoupled (Spark, DuckDB, pandas) can all read Parquet.
Queries can be translated between different engines.

Reference

Other benefits: