PuffinDB 🐧

Serverless data lake HTAP engine powered by Arrow × DuckDB × Iceberg

Kickoff meetup: Rovinj, Croatia, March 29-30, 2023

Introduction

This is a proposal for an open source project sponsored by STOIC. Its purpose is to make it easier to run DuckDB on serverless functions (AWS Lambda, Azure Function, Google Cloud Function) for executing read | write queries against objects managed by an Object Store (Amazon S3, Azure Blob Storage, Google Cloud Storage) and tables managed by a Lakehouse (Apache Iceberg, Apache Hudi, Delta Lake).

If you are using DuckDB client-side with any client application, adding PuffinDB will let you:

Collaborate on the same Iceberg tables with other users
Write back to an Iceberg table with ACID transactional integrity
Handle datasets that are too large for your client
Accelerate queries that run too slow on your client
Integrate with external data sources (Cf. Edge-Driven Data Integration)
Accelerate the downloading of large tables to your client
Schedule fetching and client-side caching of remote datasets
Cache tables and run computations at the edge (Amazon CloudFront × Lambda@Edge)

All it takes is an AWS account and a few clicks on the AWS Marketplace (PuffinDB is free and runs on your VPC).

PuffinDB is an initiative of STOIC, and not DuckDB Labs or the DuckDB Foundation.

DuckDB and the DuckDB logo are trademarks of the DuckDB Foundation.

STOIC is a Silver Member of the DuckDB Foundation.

Beliefs

Nothing beats SQL because nothing can beat maths
Arrow × DuckDB × Iceberg are game changers
Edge-Driven Data Integration is the way forward
Clientless + Serverless = Goodness

Outline

Serverless architecture
Supporting both read and write queries (HTAP)
Implemented in Node.js (to be upgraded to Bun) and Rust
Powered by Arrow × DuckDB × Iceberg
Powered by Redis (using Amazon ElastiCache for Redis) for superfast shuffles
Integrated with Apache Iceberg, Apache Hudi, and Delta Lake
Deployed on AWS first, then Microsoft Azure and Google Cloud
Invoked through an HTTP endpoint served by Amazon API Gateway
Deployed as two AWS Lambda functions
Integrated with Amazon Athena
Packaged as an AWS CloudFormation template (using Terraform)
Released as a free AWS Marketplace product
Running on your Amazon VPC
Licensed under MIT License

Features

Distributed SQL query engine powered by DuckDB
Distributed SQL query planner powered by DuckDB
Distributed SQL query execution coordinated by Redis (using Amazon ElastiCache for Redis)
Read queries executed by DuckDB (on AWS Lambda)
Write queries against Object Store objects executed by DuckDB
Write queries against Lakehouse tables executed by Amazon Athena
Built-in Malloy to SQL translator
Built-in PRQL to SQL translator
Built-in SQL dialect converter
Built-in SQL parser | stringifier
Sub-500ms table scanning API (fetch table partitions from filter predicates) running on standalone function
Concurrent support for multiple table formats (Apache Iceberg, Apache Hudi, and Delta Lake)
Concurrent suport for multiple Lakehouse instances
Native support for all Lakehouse Catalogs (AWS Glue Data Catalog, Amazon DynamoDB, and Amazon RDS)
Support for authentication and authorization
Support for synchronous and asynchronous invocations
Support for cascading remote invocations with SELECT THROUGH syntax
Joins across heterogenous tables using different table formats
Joins across tables managed by different Lakehouse instances
Small filtered partitions cached on AWS Lambda function
Query results returned as HTTP response, serialized on Object Store, or streamed through Apache Arrow
Query results cached on Object Store (Amazon S3) and CDN (Amazon CloudFront)
Query logs recorded as JSON values in Redis cluster (using Amazon ElastiCache for Redis)
Transparent support for all file formats supported by DuckDB and the Lakehouse
Transparent support for all table lifecycle features offered by the Lakehouse
Planned support for deployment on Amazon EC2 and AWS Fargate

Deployment

PuffinDB will support four complementary deployment options:

Node.js module deeply integrated within your own tool or application
AWS Lambda deployed within your own cloud platform
AWS CloudFormation template deployed within your own VPC
AWS Marketplace product added to your own cloud environment

Philosophy

Developer-first — no non-sense, zero friction
Minimalist architecture — less dependencies is better
Lowest latency — every millisecond counts
Elastic design — from kilobytes to petabytes
Less is goodness — clientless & serverless

FAQ

Please check our Frequently Asked Questions.

Roadmap

Please check our Roadmap.

Credits

This project leverages several DuckDB features implemented by DuckDB Labs and funded by STOIC:

Support for Apache Arrow streaming when using Node.js deployment (released)
Support for user-defined functions when using Node.js deployment (released)
Support for map-reduced queries with binary map results using new COMBINE function (released)
Support for import of Hive partitions (released)
Support for partitioned exports with COPY ... TO ... PARTITION_BY (released)
Support for SQL query parsing | stringifying through standard query API (under development)
Support for Azure Blob Storage (development starting soon)

We are also considering funding the following projects:

Expose core methods for distributed query planner
Support for SELECT * THROUGH 'https://myPuffinDB.com/' FROM remoteTable syntax (Cf. EDDI)
Support for FIXED fixed-length character strings (Cf. #3)

This project was initially inspired by this excellent article from Alon Agmon.

Discussions

Most discussions about this project are currently taking place on the @ghalimi Twitter account.

For a lower-frequency alternative, please follow @PuffinDB.

Notes

PuffinDB should not be confused with the Puffin file format.

Be stoic, be kind, be cool. Like a puffin...

mxmzdlv/puffin