/awesome-olap

A curated list of awesome Online Analytical Processing databases, frameworks, ressources and other awesomeness.

MIT LicenseMIT

Awesome OLAP Awesome

A curated list of awesome open-source Online Analytical Processing databases, frameworks, ressources, tools and other awesomeness, for data engineers.

OLAP Databases

Real-time analytics

The following columnar databases use a shared-nothing architecture and provide a sub-second response time. DDL, DML and DCL are operated via SQL. These databases also support tiering for long-term cold storage.

Search engines

  • Elasticsearch - Search and analytics engine based on Apache Lucene.
  • OpenSearch - Apache 2.0 fork of Elasticsearch.
  • Quickwit - Search engine on top of object storage, using shared-everything architecture.
  • Meilisearch - Open source search engine, aims to be a ready-to-go solution.
  • Typesense - Оpen-source, typo-tolerant search engine optimized for instant search-as-you-type experiences and developer productivity.

NewSQL

  • Citus - PostgreSQL compatible distributed table.
  • TiDB - MySQL compatible SQL database that supports hybrid transactional and analytical processing workloads.

Timeseries

Managed cloud services

Data lake

The data lake approach (or "lakehouse") is a semi-structured schema that sit on top of object storage in the cloud.

It is composed of a few layers (from lower to higher level): codec, file format, table format + metastore, and the ingestion/query layer.

File formats and serialization

These formats are popular for shared-everything databases, using object storage as persistence layer. The data is organized in row or column, with strict schema definition. These files are immutable and offer partial reads (only headers, metadata, data page, etc). Mutation require a new upload. Most formats support nested schema, codecs, compression and data encryption. Index can be added to file metadata for faster processing.

A single file can weight between tens of MB to a few GB. Lot of small files require more merge operation. Larger files can be costly to update.

  • Apache Arrow Columnar Format - Columnar format for in-memory Apache Arrow processing.
  • Apache Avro - Row-oriented serialization for data streaming purpose.
  • Apache ORC - Column-oriented serialization for data storage purpose. Part of Hadoop platform.
  • Apache Parquet - Column-oriented serialization for data storage purpose.
  • Apache Thrift - Row-oriented serialization for RPC purpose.
  • Google Protobuf - Row-oriented serialization for RPC purpose.
  • Schema Registry - Centralized repository for validating row-oriented events. Part of Kafka and Confluent platform.

Open table formats

Open table formats are abstraction layer on top of Avro/Parquet files, with support for ACID transaction, CDC, partitioning, mixed streaming/batching processing, schema evolution and mutation. Schema and statistics are stored in a metastore, data is persisted locally or in a remote/cloud object storage.

Open tables are a cost-effective datawarehouse for petabyte scale.

Comparison:

πŸ‘† Warning: pre-2022 articles should be considered as out-of-date, as open table formats are evolving quickly.

Metastore

Object Storage

Codecs, encoding and compression

Brokers and distributed messaging

Ingestion and querying

Stream processing

Process a set of data in real-time (or near-real-time), as it is being generated.

Batch processing

Process periodically a large amount of data in a single batch.

In-memory processing

Non real-time SQL queries executed against a large database can be processed locally. This method might not fit into memory or lead to very long job duration.

  • Apache Arrow - Low-level in-memory data processing. Zero-copy data manipulation for any language, via gRPC/IPC interfaces.
  • Apache Arrow Datafusion - High level SQL interface for Apache Arrow.
  • delta-rs - Standalone DeltaLake driver for Python and Rust. Do not depend on Spark.
  • Delta Standalone - Standalone DeltaLake driver for Java and Scala. Do not depend on Spark.
  • DuckDB - In-process SQL query engine for processing Parquet files. Built on top of Apache Arrow.
  • Pandas - Python data analysis and manipulation tool.
  • clickhouse-local - Lightweight CLI version of Clickhouse for running SQL queries against CSV, JSON, Parquet, etc files.

Distributed SQL processing

These SQL engines distribute SQL queries processing of very large database on a cluster. Support of ANSI SQL.

  • Apache Spark SQL - Distributed SQL query engine that sit on top of Spark.
  • ksql - SQL interface for Kafka.
  • PrestoDB - Distributed SQL query engine.
  • Trino - Distributed SQL query engine. Fork of PrestoDB.

Scheduler

These tools allow to orchestrate, schedule and monitor repetitive data transformations, in a workflow manner.

ETL, ELT and reverse ETL

The popular acronym for Extracting, Transforming and Loading data. ELT performs data transformations directly within the data warehouse. Reverse ETL is the process of copying data from your datawarehouse to external tools or SaaS.

  • Airbyte - ELT.
  • Census - Reverse ETL.
  • RudderStack - Customer Data Platform. Pipeline between a tracking plan, event transformation, and destination tools (datawarehouse or SaaS).

Datasets

Benchmark

Readings

Papers

Architecture

Data modeling

Index

Vector similarity search

Vectorized query processing

Querying

Transactions

Consensus

Challenging platforms

Blogs to follow

More

People to follow

// TODO

Events

// TODO

Communities

// TODO

🀝 Contributing

Contributions of any kind welcome, just follow the guidelines!

πŸ‘€ Contributors

Contributors

πŸ’« Show your support

Give a ⭐️ if this project helped you!

GitHub Sponsors

πŸ“ License

Copyright Β© 2023 Samuel Berthe.

This project is MIT licensed.