This Awesome List aims at providing an overview of open-source projects related to data engineering. This is a community effort: please contribute and send your pull requests for growing this list! For a list including non-OSS tools, see this amazing Awesome List.
-
Apache Spark - A unified analytics engine for large-scale data processing.
-
Apache Superset - A modern, enterprise-ready business intelligence web application.
-
Metabase - An easy way for everyone in your company to ask questions and learn from data.
-
Redash - All the tools to unlock your data.
-
Apache Calcite - SQL parser, building blocks for datastores.
-
Apache Cassandra - Open Source distributed wide column store, NoSQL database.
-
Apache Druid - A high performance real-time analytics database.
-
Apache HBase - Open Source non-relational distributed database.
-
Apache Pinot - A realtime distributed OLAP datastore.
-
ClickHouse - Open Source distributed column-oriented DBMS.
-
InfluxDB - Purpose-Built Open Source Time Series Database.
-
Postgres - The World’s Most Advanced Open Source Relational Database.
-
MinIO - MinIO is a high performance, distributed object storage system and AWS S3 compatible.
-
Amundsen - metadata catalogue.
-
Apache Atlas - Data governance and metadata framework for Hadoop.
-
DataHub - A Generalized Metadata Search & Discovery Tool.
-
Metacat - Unified metadata exploration API service.
-
Teiid - A relational abstraction of different information sources.
-
Presto - Distributed SQL Query Engine for Big Data.
-
Apache Drill - Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.
-
Apache Avro - A data serialization system.
-
Apache Parquet - A columnar storage format.
-
Apache ORC - Another columnar storage format.
-
Apache Thrift - Data type and service interface definitions and code generator.
-
Cap’n Proto - A data interchange format and capability-based RPC system.
-
FlatBuffers - An efficient cross platform serialization library for C++, C#, C, Go, Java, JavaScript, Lobster, Lua, TypeScript, PHP, Python, and Rust.
-
Protocol Buffers - Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data.
-
MessagePack - An efficient binary serialization format. It lets you exchange data among multiple languages like JSON.
-
Apache Camel - Easily integrate various systems consuming or producing data.
-
Kafka Connect - Reusable framework to handle data int-and-out of Apache Kafka.
-
Logstash - Open Source server-side data processing pipeline.
-
Apache ActiveMQ - Flexible & Powerful Multi-Protocol Messaging.
-
Apache Kafka - A distributed commit log with messaging capabilities.
-
Apache Pulsar - A distributed pub-sub messaging system.
-
Liiklus - An event gateway that provides reactive gRPC/RSocket access to Kafka-like systems.
-
Nakadi - A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues].
-
NATS - A simple, secure and high performance messaging system.
-
RabbitMQ - A message broker.
-
Waltz - A quorum-based distributed write-ahead log for replicating transactions.
-
ZeroMQ - An open-source universal, high-performance messaging library.
-
CloudEvents - A specification for describing event data in a common way.
-
Apache Beam - Implement batch and streaming data processing jobs that run on any execution engine.
-
Apache Flink - Stateful computations over data streams.
-
Apache Kafka Streams - A client library for building applications and microservices, where the input and output data are stored in Kafka.
-
Apache Samza - A distributed stream processing framework.
-
Apache Spark Structured Streaming - A scalable and fault-tolerant stream processing engine built on the Spark SQL engine.
-
Apache Storm - A distributed realtime computation system.
-
Great expectations - Helps data teams eliminate pipeline debt, through data testing.
-
Awesome Workflow Engines - A curated list of awesome open source workflow engines.
-
Apache Airflow - A platform created by community to programmatically author, schedule and monitor workflows.
-
Prefect - A workflow management system designed for modern infrastructure.
-
Apache NiFi - Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic
only overview contents, no specific tools
-
NOSQL Database Management Systems - List of NoSQL database management systems.
-
DB-Engines - Knowledge base of relational and NoSQL database management systems.
Not quite sure yet where to put these