Serverless data lake HTAP engine powered by Arrow × DuckDB × Iceberg
Kickoff meetup: Rovinj, Croatia, March 29-30, 2023
This is a proposal for an open source project sponsored by STOIC. Its purpose is to make it easier to run DuckDB on serverless functions (AWS Lambda, Azure Function, Google Cloud Function) for executing read | write queries against objects managed by an Object Store (Amazon S3, Azure Blob Storage, Google Cloud Storage) and tables managed by a Lakehouse (Apache Iceberg, Apache Hudi, Delta Lake).
If you are using DuckDB client-side with any client application, adding PuffinDB will let you:
- Collaborate on the same Iceberg tables with other users
- Write back to an Iceberg table with ACID transactional integrity
- Handle datasets that are too large for your client
- Accelerate queries that run too slow on your client
- Integrate with external data sources (Cf. Edge-Driven Data Integration)
- Accelerate the downloading of large tables to your client
- Schedule fetching and client-side caching of remote datasets
- Cache tables and run computations at the edge (Amazon CloudFront × Lambda@Edge)
All it takes is an AWS account and a few clicks on the AWS Marketplace (PuffinDB is free and runs on your VPC).
PuffinDB is an initiative of STOIC, and not DuckDB Labs or the DuckDB Foundation.
DuckDB and the DuckDB logo are trademarks of the DuckDB Foundation.
STOIC is a Silver Member of the DuckDB Foundation.
- Nothing beats SQL because nothing can beat maths
- Arrow × DuckDB × Iceberg are game changers
- Edge-Driven Data Integration is the way forward
- Clientless + Serverless = Goodness
- Serverless architecture
- Supporting both read and write queries (HTAP)
- Implemented in Node.js (to be upgraded to Bun) and Rust
- Powered by Arrow × DuckDB × Iceberg
- Powered by Redis (using Amazon ElastiCache for Redis) for superfast shuffles
- Integrated with Apache Iceberg, Apache Hudi, and Delta Lake
- Deployed on AWS first, then Microsoft Azure and Google Cloud
- Invoked through an HTTP endpoint served by Amazon API Gateway
- Deployed as two AWS Lambda functions
- Integrated with Amazon Athena
- Packaged as an AWS CloudFormation template (using Terraform)
- Released as a free AWS Marketplace product
- Running on your Amazon VPC
- Licensed under MIT License
- Distributed SQL query engine powered by DuckDB
- Distributed SQL query planner powered by DuckDB
- Distributed SQL query execution coordinated by Redis (using Amazon ElastiCache for Redis)
- Read queries executed by DuckDB (on AWS Lambda)
- Write queries against Object Store objects executed by DuckDB
- Write queries against Lakehouse tables executed by Amazon Athena
- Built-in Malloy to SQL translator
- Built-in PRQL to SQL translator
- Built-in SQL dialect converter
- Built-in SQL parser | stringifier
- Sub-500ms table scanning API (fetch table partitions from filter predicates) running on standalone function
- Concurrent support for multiple table formats (Apache Iceberg, Apache Hudi, and Delta Lake)
- Concurrent suport for multiple Lakehouse instances
- Native support for all Lakehouse Catalogs (AWS Glue Data Catalog, Amazon DynamoDB, and Amazon RDS)
- Support for authentication and authorization
- Support for synchronous and asynchronous invocations
- Support for cascading remote invocations with
SELECT THROUGH
syntax - Joins across heterogenous tables using different table formats
- Joins across tables managed by different Lakehouse instances
- Small filtered partitions cached on AWS Lambda function
- Query results returned as HTTP response, serialized on Object Store, or streamed through Apache Arrow
- Query results cached on Object Store (Amazon S3) and CDN (Amazon CloudFront)
- Query logs recorded as JSON values in Redis cluster (using Amazon ElastiCache for Redis)
- Transparent support for all file formats supported by DuckDB and the Lakehouse
- Transparent support for all table lifecycle features offered by the Lakehouse
- Planned support for deployment on Amazon EC2 and AWS Fargate
PuffinDB will support four complementary deployment options:
- Node.js module deeply integrated within your own tool or application
- AWS Lambda deployed within your own cloud platform
- AWS CloudFormation template deployed within your own VPC
- AWS Marketplace product added to your own cloud environment
- Developer-first — no non-sense, zero friction
- Minimalist architecture — less dependencies is better
- Lowest latency — every millisecond counts
- Elastic design — from kilobytes to petabytes
- Less is goodness — clientless & serverless
Please check our Frequently Asked Questions.
Please check our Roadmap.
This project was initiated and is currently funded by STOIC.
Please check our sponsors page for sponsorship opportunities.
This project leverages several DuckDB features implemented by DuckDB Labs and funded by STOIC:
- Support for Apache Arrow streaming when using Node.js deployment (released)
- Support for user-defined functions when using Node.js deployment (released)
- Support for map-reduced queries with binary map results using new
COMBINE
function (released) - Support for import of Hive partitions (released)
- Support for partitioned exports with
COPY ... TO ... PARTITION_BY
(released) - Support for SQL query parsing | stringifying through standard query API (under development)
- Support for Azure Blob Storage (development starting soon)
We are also considering funding the following projects:
- Expose core methods for distributed query planner
- Support for
SELECT * THROUGH 'https://myPuffinDB.com/' FROM remoteTable
syntax (Cf. EDDI) - Support for
FIXED
fixed-length character strings (Cf. #3)
This project was initially inspired by this excellent article from Alon Agmon.
Most discussions about this project are currently taking place on the @ghalimi Twitter account.
For a lower-frequency alternative, please follow @PuffinDB.
PuffinDB should not be confused with the Puffin file format.
Be stoic, be kind, be cool. Like a puffin...