This is Lakekeeper: A Rust-native implementation of the Apache Iceberg REST Catalog specification based on apache/iceberg-rust.
If you have questions, feature requests or just want a chat, we are hanging around in Discord!
The catalog is evolving quickly. Especially internal rust APIs are not stable and subject to change. External REST APIs are kept as stable as possible, especially the /catalog
API is stable as of today.
Our next milestones are:
- Release 0.4.0 early October including anticipated features such as GCP support, soft-deletions, security improvements and caching
- Release 0.5.0 end of October / early November including authorization, UI and various management APIs. From this release onwards, we also consider management APIs stable.
- Following Release 0.5.0 we will focus on docs and simplify usability
The Iceberg Catalog REST interface has become the standard for catalogs in open Lakehouses. It natively enables multi-table commits, server-side deconflicting and much more. It is figuratively the (TIP) of the Iceberg.
We have started this implementation because we were missing customizability, support for on-premise deployments and other features that are important for us in existing Iceberg Catalogs. Please find following some of our focuses with this implementation:
- Customizable: Our implementation is meant to be extended. We expose the Database implementation, Secrets, Authorization, EventPublishing and ContractValidation as interfaces (Traits). This allows you to tap into any Access management system of your company or stream change events to any system you like - simply by implementing a handful methods. Please find more details in the Customization Guide.
- Change Events: Built-in support to emit change events (CloudEvents), which enables you to react to any change that happen to your tables.
- Change Approval: Changes can also be prohibited by external systems. This can be used to prohibit changes to tables that would invalidate Data Contracts, Quality SLOs etc. Simply integrate with your own change approval via our
ContractVerification
trait. - Multi-Tenant capable: A single deployment of our catalog can serve multiple projects - all with a single entrypoint. All Iceberg and Warehouse configurations are completely separated between Warehouses.
- Written in Rust: Single 30Mb all-in-one binary - no JVM or Python env required.
- Storage Access Management: Built-in S3-Signing that enables support for self-hosted as well as AWS S3 WITHOUT sharing S3 credentials with clients. We are also working on
vended-credentials
! - Well-Tested: Integration-tested with
spark
andpyiceberg
(support for S3 with this catalog from pyiceberg 0.7.0) - High Available & Horizontally Scalable: There is no local state - the catalog can be scaled horizontally and updated without downtimes.
- Openid provider integration: Use your own identity provider to secure access to the APIs, just set
ICEBERG_REST__OPENID_PROVIDER_URI
and you are good to go. - Fine Grained Access (FGA) (Coming soon): Simple Role-Based access control is not enough for many rapidly evolving Data & Analytics initiatives. We are leveraging OpenFGA based on googles Zanzibar-Paper to implement authorization. If your company already has a different system in place, you can integrate with it by implementing a handful of methods in the
AuthZHandler
trait.
Please find following an overview of currently supported features. Please also check the Issues if you are missing something.
A Docker Container is available on quay.io.
We have prepared a self-contained docker-compose file to demonstrate the usage of spark
with our catalog:
git clone https://github.com/hansetag/iceberg-catalog.git
cd iceberg-catalog/examples/self-contained
docker compose up
Then open your browser and head to localhost:8888
.
For more information on deployment, please check the User Guide.
Details on how to configure the storage profiles can be found in the Storage Guide.
Backend | Status | Comment |
---|---|---|
Postgres | ||
MongoDB |
Backend | Status | Comment |
---|---|---|
Postgres | ||
kv2 (hcp-vault) | userpass auth |
Backend | Status | Comment |
---|---|---|
Nats | ||
Kafka |
Operation | Status | Description |
---|---|---|
Warehouse Management | Create / Update / Delete a Warehouse | |
AuthZ | Manage access to warehouses, namespaces and tables | |
More to come! |
The iceberg-rest server can host multiple independent warehouses that are again grouped by projects. The overall structure looks like this:
<project-1-uuid>/
├─ foo-warehouse
├─ bar-warehouse
<project-2-uuid>/
├─ foo-warehouse
├─ bas-warehouse
All warehouses use isolated namespaces and can be configured in client by specifying warehouse
as '<project-uuid>/<warehouse-name>'
. Warehouse Names inside Projects must be unique. We recommend using human
readable names for warehouses.
If you do not need the hierarchy level of projects, set the ICEBERG_REST__DEFAULT_PROJECT_ID
environment variable to
the project you want to use. For single project deployments we recommend using the NULL UUID ("
00000000-0000-0000-0000-000000000000") as project-id. Users then just specify warehouse
as <warehouse-name>
when
connecting.
When a table or view is dropped, it is not immediately deleted from the catalog. Instead, it is marked as dropped and a job for its cleanup is scheduled. The table, including its data if purgeRequested=True
, is then deleted after the configured ICEBERG_TABULAR_EXPIRATION_DELAY_SECONDS
(default: 7 days) have passed. This will allow for a recovery of tables that have been dropped by accident.
The basic setup of the Catalog is configured via environment variables. As this catalog supports a multi-tenant setup, each catalog ("warehouse") also comes with its own configuration options including its Storage Configuration. The documentation of the Management-API for warehouses is hosted at the unprotected /swagger-ui
endpoint.
Following options are global and apply to all warehouses:
Currently, the catalog uses two task queues, one to ultimately delete soft-deleted tabulars and another to purge tabulars which have been deleted with the purgeRequested=True
query parameter. The task queues are configured as follows:
Variable | Example | Description |
---|---|---|
ICEBERG_REST__QUEUE_CONFIG__MAX_RETRIES |
5 | Number of retries before a task is considered failed Default: 5 |
ICEBERG_REST__QUEUE_CONFIG__MAX_AGE |
3600 | Amount of seconds before a task is considered stale and could be picked up by another worker. Default: 3600 |
ICEBERG_REST__QUEUE_CONFIG__POLL_INTERVAL |
10 | Amount of seconds between polling for new tasks. Default: 10 |
The queues are currently implemented using the sqlx
Postgres backend. If you want to use a different backend, you need to implement the TaskQueue
trait.
Configuration parameters if Postgres is used as a backend, you may either provide connection strings or use the PG_*
environment variables, connection strings take precedence:
Variable | Example | Description |
---|---|---|
ICEBERG_REST__PG_DATABASE_URL_READ |
postgres://postgres:password@localhost:5432/iceberg |
Postgres Database connection string used for reading |
ICEBERG_REST__PG_DATABASE_URL_WRITE |
postgres://postgres:password@localhost:5432/iceberg |
Postgres Database connection string used for writing. |
ICEBERG_REST__PG_ENCRYPTION_KEY |
<This is unsafe, please set a proper key> |
If ICEBERG_REST__SECRET_BACKEND=postgres , this key is used to encrypt secrets. It is required to change this for production deployments. |
ICEBERG_REST__PG_READ_POOL_CONNECTIONS |
10 |
Number of connections in the read pool |
ICEBERG_REST__PG_WRITE_POOL_CONNECTIONS |
5 |
Number of connections in the write pool |
ICEBERG_REST__PG_HOST_R |
localhost |
Hostname for read operations |
ICEBERG_REST__PG_HOST_W |
localhost |
Hostname for write operations |
ICEBERG_REST__PG_PORT |
5432 |
Port number |
ICEBERG_REST__PG_USER |
postgres |
Username for authentication |
ICEBERG_REST__PG_PASSWORD |
password |
Password for authentication |
ICEBERG_REST__PG_DATABASE |
iceberg |
Database name |
ICEBERG_REST__PG_SSL_MODE |
require |
SSL mode (disable, allow, prefer, require) |
ICEBERG_REST__PG_SSL_ROOT_CERT |
/path/to/root/cert |
Path to SSL root certificate |
ICEBERG_REST__PG_ENABLE_STATEMENT_LOGGING |
true |
Enable SQL statement logging |
ICEBERG_REST__PG_TEST_BEFORE_ACQUIRE |
true |
Test connections before acquiring from the pool |
ICEBERG_REST__PG_CONNECTION_MAX_LIFETIME |
1800 |
Maximum lifetime of connections in seconds |
Configuration parameters if a KV2 compatible storage is used as a backend. Currently, we only support the userpass
authentication method. You may provide the envs as single values like ICEBERG_REST__KV2__URL=http://vault.local
etc. or as a compound value like:
ICEBERG_REST__KV2='{url="http://localhost:1234", user="test", password="test", secret_mount="secret"}'
Variable | Example | Description |
---|---|---|
ICEBERG_REST__KV2__URL |
https://vault.local |
URL of the KV2 backend |
ICEBERG_REST__KV2__USER |
admin |
Username to authenticate against the KV2 backend |
ICEBERG_REST__KV2__PASSWORD |
password |
Password to authenticate against the KV2 backend |
ICEBERG_REST__KV2__SECRET_MOUNT |
kv/data/iceberg |
Path to the secret mount in the KV2 backend |
If you want the server to publish events to a NATS server, set the following environment variables:
Variable | Example | Description |
---|---|---|
ICEBERG_REST__NATS_ADDRESS |
nats://localhost:4222 |
The URL of the NATS server to connect to |
ICEBERG_REST__NATS_TOPIC |
iceberg |
The subject to publish events to |
ICEBERG_REST__NATS_USER |
test-user |
User to authenticate against nats, needs ICEBERG_REST__NATS_PASSWORD |
ICEBERG_REST__NATS_PASSWORD |
test-password |
Password to authenticate against nats, needs ICEBERG_REST__NATS_USER |
ICEBERG_REST__NATS_CREDS_FILE |
/path/to/file.creds |
Path to a file containing nats credentials |
ICEBERG_REST__NATS_TOKEN |
xyz |
Nats token to authenticate against server |
If you want to limit ac
cess to the API, set ICEBERG_REST__OPENID_PROVIDER_URI
to the URI of your OpenID Connect Provider. The catalog will then verify access tokens against this provider. The provider must have the .well-known/openid-configuration
endpoint under ${ICEBERG_REST__OPENID_PROVIDER_URI}/.well-known/openid-configuration
and the openid-configuration needs to have the jwks_uri
and issuer
defined.
If ICEBERG_REST__OPENID_PROVIDER_URI
is set, every request needs have an authorization header, e.g.
curl {your-catalog-url}/catalog/v1/transactions/commit -X POST -H "authorization: Bearer {your-token-here}" -H "content-type: application/json" -d ...
Variable | Example | Description |
---|---|---|
ICEBERG_REST__OPENID_PROVIDER_URI |
https://keycloak.local/realms/{your-realm} |
OpenID Provider URL, with keycloak this is the url pointing to your realm, for Azure App Registration it would be something like https://login.microsoftonline.com/{your-tenant-id-here}/v2.0/ . If this variable is not set, endpoints are not secured |
Licensed under the Apache License, Version 2.0