/puffin

Serverless data lake HTAP engine powered by Arrow × DuckDB × Iceberg

MIT LicenseMIT

PuffinDB 🐧

Serverless data lake HTAP engine powered by Arrow × DuckDB × Iceberg

Kickoff meetup: Rovinj, Croatia, March 29-30, 2023

Introduction

This is a proposal for an open source project sponsored by STOIC. Its purpose is to make it easier to run DuckDB on serverless functions (AWS Lambda, Azure Function, Google Cloud Function) for executing read | write queries against objects managed by an Object Store (Amazon S3, Azure Blob Storage, Google Cloud Storage) and tables managed by a Lakehouse (Apache Iceberg, Apache Hudi, Delta Lake).

If you are using DuckDB client-side with any client application, adding PuffinDB will let you:

  • Collaborate on the same Iceberg tables with other users
  • Write back to an Iceberg table with ACID transactional integrity
  • Handle datasets that are too large for your client
  • Accelerate queries that run too slow on your client
  • Integrate with external data sources (Cf. Edge-Driven Data Integration)
  • Accelerate the downloading of large tables to your client
  • Schedule fetching and client-side caching of remote datasets
  • Cache tables and run computations at the edge (Amazon CloudFront × Lambda@Edge)

All it takes is an AWS account and a few clicks on the AWS Marketplace (PuffinDB is free and runs on your VPC).

PuffinDB is an initiative of STOIC, and not DuckDB Labs or the DuckDB Foundation.

DuckDB and the DuckDB logo are trademarks of the DuckDB Foundation.

STOIC is a Silver Member of the DuckDB Foundation.

Beliefs

Outline

Features

Deployment

PuffinDB will support four complementary deployment options:

Philosophy

  • Developer-first — no non-sense, zero friction
  • Minimalist architecture — less dependencies is better
  • Lowest latency — every millisecond counts
  • Elastic design — from kilobytes to petabytes
  • Less is goodnessclientless & serverless

FAQ

Please check our Frequently Asked Questions.

Roadmap

Please check our Roadmap.

Sponsors

This project was initiated and is currently funded by STOIC.

Please check our sponsors page for sponsorship opportunities.

Credits

This project leverages several DuckDB features implemented by DuckDB Labs and funded by STOIC:

  • Support for Apache Arrow streaming when using Node.js deployment (released)
  • Support for user-defined functions when using Node.js deployment (released)
  • Support for map-reduced queries with binary map results using new COMBINE function (released)
  • Support for import of Hive partitions (released)
  • Support for partitioned exports with COPY ... TO ... PARTITION_BY (released)
  • Support for SQL query parsing | stringifying through standard query API (under development)
  • Support for Azure Blob Storage (development starting soon)

We are also considering funding the following projects:

  • Expose core methods for distributed query planner
  • Support for SELECT * THROUGH 'https://myPuffinDB.com/' FROM remoteTable syntax (Cf. EDDI)
  • Support for FIXED fixed-length character strings (Cf. #3)

This project was initially inspired by this excellent article from Alon Agmon.

Discussions

Most discussions about this project are currently taking place on the @ghalimi Twitter account.

For a lower-frequency alternative, please follow @PuffinDB.

Notes

PuffinDB should not be confused with the Puffin file format.

Be stoic, be kind, be cool. Like a puffin...

Sutoiku, Inc.