This is a docker compose environment to quickly get up and running with a Spark environment and a local Iceberg catalog. It uses a postgres database as a JDBC catalog.
note: If you don't have docker installed, you can head over to the Get Docker page for installation instructions.
Start up the notebook server by running the following.
docker-compose up
The notebook server will then be available at http://localhost:8888
While the notebook server is running, you can use any of the following commands if you prefer to use spark-shell, spark-sql, or pyspark.
docker exec -it spark-iceberg spark-shell
docker exec -it spark-iceberg spark-sql
docker exec -it spark-iceberg pyspark
To stop everything, just run docker-compose down
.
To reset the catalog and data, remove the postgres
and warehouse
directories.
docker-compose down && docker-compose kill && rm -rf ./postgres && rm -rf ./warehouse
The prebuilt spark image is uploaded to Dockerhub. Out of convenience, the image tag defaults to latest
.
If you have an older version of the image, you might need to remove it to upgrade.
docker image rm tabulario/spark-iceberg && docker-compose pull
To directly use the Dockerfile in this repo (as opposed to pulling the hosted tabulario/spark-iceberg
image), change the following line in any of the docker-compose files.
- image: tabulario/spark-iceberg
+ build: spark/
To deploy changes to the hosted docker image tabulario/spark-iceberg
, run the following. (Requires access to the tabulario docker hub account)
cd spark
docker buildx build -t tabulario/spark-iceberg --platform=linux/amd64,linux/arm64 . --push
For more information on getting started with using Iceberg, checkout the Getting Started guide in the official docs.
The repository for the docker image is located on dockerhub.