Awesome Data Engineering

A curated list of data engineering tools for software developers

List of content

Databases
Ingestion
File System
Serialization format
Stream Processing
Batch Processing
Charts and Dashboards
Workflow
Datasets
Monitoring
Docker

Databases

Relational
- RQLite Replicated SQLite using the Raft consensus protocol
- MySQL The world's most popular open source database.
  - TiDB TiDB is a distributed NewSQL database compatible with MySQL protocol
  - Percona XtraBackup Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®
  - mysql_utils Pinterest MySQL Management Tools
- MariaDB An enhanced, drop-in replacement for MySQL.
- PostgreSQL The world's most advanced open source database.
- Amazon RDS Amazon RDS makes it easy to set up, operate, and scale a relational database in the cloud.
- Crate.IO Scalable SQL database with the NOSQL goodies.
Key-Value
- Redis An open source, BSD licensed, advanced key-value cache and store.
- Riak A distributed database designed to deliver maximum data availability by distributing data across multiple servers.
- AWS DynamoDB A fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale.
- HyperDex HyperDex is a scalable, searchable key-value store
- SSDB A high performance NoSQL database supporting many data structures, an alternative to Redis
- Kyoto Tycoon Kyoto Tycoon is a lightweight network server on top of the Kyoto Cabinet key-value database, built for high-performance and concurrency
- IonDB A key-value store for microcontroller and IoT applications
Column
- Cassandra The right choice when you need scalability and high availability without compromising performance.
  - Cassandra Calculator This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application.
  - CCM A script to easily create and destroy an Apache Cassandra cluster on localhost
  - ScyllaDB NoSQL data store using the seastar framework, compatible with Apache Cassandra http://www.scylladb.com/
- HBase The Hadoop database, a distributed, scalable, big data store.
- Infobright Column oriented, open-source analytic database provides both speed and efficiency.
- AWS Redshift A fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools.
- FiloDB (https://github.com/tuplejump/FiloDB) Distributed. Columnar. Versioned. Streaming. SQL.
- HPE Vertica Distributed, MPP columnar database with extensive analytics SQL.
Document
- MongoDB An open-source, document database designed for ease of development and scaling.
  - Percona Server for MongoDB Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.
  - MemDB Distributed Transactional In-Memory Database (based on MongoDB)
- Elasticsearch Search & Analyze Data in Real Time.
- Couchbase The highest performing NoSQL distributed database.
- RethinkDB The open-source database for the realtime web.
Graph
- Neo4j The world’s leading graph database.
- OrientDB 2nd Generation Distributed Graph Database with the flexibility of Documents in one product with an Open Source commercial friendly license.
- ArangoDB A distributed free and open-source database with a flexible data model for documents, graphs, and key-values.
- Titan A scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.
- FlockDB A distributed, fault-tolerant graph database by Twitter.
Distributed
- DAtomic The fully transactional, cloud-ready, distributed database.
- Apache Geode An open source, distributed, in-memory database for scale-out applications.
- Gaffer A large-scale graph database
Timeseries
- InfluxDB Scalable datastore for metrics, events, and real-time analytics.
- OpenTSDB A scalable, distributed Time Series Database.
- kairosdb Fast scalable time series database.
- Heroic A scalable time series database based on Cassandra and Elasticsearch, by Spotify
- Druid Column oriented distributed data store ideal for powering interactive applications
- Riak-TS Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data
- Akumuli Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
- Rhombus A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
- Dalmatiner DB Fast distributed metrics database
- Blueflood A distributed system designed to ingest and process time series data
- Timely Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.
Other
- Tarantool Tarantool is an in-memory database and application server.
- GreenPlum The Greenplum Database (GPDB) is an advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes.
- cayley An open-source graph database. Google.
- SnappydataSnappyData: OLTP + OLAP Database built on Apache Spark

Data Ingestion

Kafka Publish-subscribe messaging rethought as a distributed commit log.
- Camus LinkedIn's Kafka to HDFS pipeline.
- BottledWater Change data capture from PostgreSQL into Kafka
- kafkat Simplified command-line administration for Kafka brokers
- kafkacat Generic command line non-JVM Apache Kafka producer and consumer
- pg-kafka A PostgreSQL extension to produce messages to Apache Kafka
- librdkafka The Apache Kafka C/C++ library
- kafka-docker Kafka in Docker
- kafka-manager A tool for managing Apache Kafka
- kafka-node Node.js client for Apache Kafka 0.8
- Secor Pinterest's Kafka to S3 distributed consumer
- Kafka-logger Kafka-winston logger for nodejs from uber
AWS Kinesis A fully managed, cloud-based service for real-time data processing over large, distributed data streams.
RabbitMQ Robust messaging for applications.
FluentD An open source data collector for unified logging layer.
Embulk An open source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.
Apache Sqoop A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Heka Data Acquisition and Processing Made Easy
Gobblin Universal data ingestion framework for Hadoop from Linkedin

File System

HDFS
- Snakebite A pure python HDFS client
AWS S3
- smart_open Utils for streaming large files (S3, HDFS, gzip, bz2)
Tachyon Tachyon is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce
CEPH Ceph is a unified, distributed storage system designed for excellent performance, reliability and scalability
OrangeFS Orange File System is a branch of the Parallel Virtual File System
SnackFS SnackFS is our bite-sized, lightweight HDFS compatible FileSystem built over Cassandra
GlusterFS Gluster Filesystem
XtreemFS fault-tolerant distributed file system for all storage needs
SeaweedFS Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".
S3QL S3QL is a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack.
LizardFS LizardFS Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system.

Serialization format

Apache Avro Apache Avro™ is a data serialization system
Apache Parquet Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
- Snappy A fast compressor/decompressor. Used with Parquet
- PigZ A parallel implementation of gzip for modern multi-processor, multi-core machines
Apache ORC The smallest, fastest columnar storage for Hadoop workloads
Apache Thrift The Apache Thrift software framework, for scalable cross-language services development
ProtoBuf Protocol Buffers - Google's data interchange format
SequenceFile SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats
Kryo Kryo is a fast and efficient object graph serialization framework for Java

Stream Processing

Spark Streaming Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
Apache Flink Apache Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
Apache Storm Apache Storm is a free and open source distributed realtime computation system
Apache Samza Apache Samza is a distributed stream processing framework
Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data
VoltDB
PipelineDB The Streaming SQL Database https://www.pipelinedb.com
Spring Cloud Dataflow Streaming and tasks execution between Spring Boot apps

Batch Processing

Hadoop MapReduce Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner
Spark
- Spark Packages A community index of packages for Apache Spark
- Deep Spark Connecting Apache Spark with different data stores
- Spark RDD API Examples by Zhen He
- Livy Livy, the REST Spark Server
AWS EMR A web service that makes it easy to quickly and cost-effectively process vast amounts of data.
Flink An open source platform for scalable batch and stream data processing.
Tez An application framework which allows for a complex directed-acyclic-graph of tasks for processing data.

Batch ML
- H2O Fast scalable machine learning API for smarter applications.
- Mahout An environment for quickly creating scalable performant machine learning applications.
- Spark MLlib Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
Batch Graph
- GraphLab Create A machine learning platform that enables data scientists and app developers to easily create intelligent apps at scale.
- Giraph An iterative graph processing system built for high scalability.
- Spark GraphX Apache Spark's API for graphs and graph-parallel computation.
Batch SQL
- Presto A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.
- Hive Data warehouse software facilitates querying and managing large datasets residing in distributed storage.
  - Hivemall Scalable machine learning library for Hive/Hadoop.
  - PyHive Python interface to Hive and Presto.
- Drill Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.

Charts and Dashboards

Highcharts A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application.
ZingChart Fast JavaScript charts for any data set.
C3.js D3-based reusable chart library.
D3.js A JavaScript library for manipulating documents based on data.
- D3Plus D3's simplier, easier to use cousin. Mostly predefined templates that you can just plug data in.
SmoothieCharts A JavaScript Charting Library for Streaming Data.
PyXley Python helpers for building dashboards using Flask and React
Plotly Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python
Apache Superset Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application

Workflow

Luigi Luigi is a Python module that helps you build complex pipelines of batch jobs.
- CronQ An application cron-like system. Used w/Luige
Cascading Java based application development platform.
Airflow Airflow is a system to programmaticaly author, schedule and monitor data pipelines.
Azkaban Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows.
Oozie Oozie is a workflow scheduler system to manage Apache Hadoop jobs
Pinball DAG based workflow manager. Job flows are defined programmaticaly in Python. Support output passing between jobs.

ELK Elastic Logstash Kibana

docker-logstash A highly configurable logstash (1.4.4) docker image running Elasticsearch (1.7.0) and Kibana (3.1.2).
elasticsearch-jdbc JDBC importer for Elasticsearch
ZomboDB Postgres Extension that allows creating an index backed by Elasticsearch

Docker

Gockerize Package golang service into minimal docker containers
Flocker Easily manage Docker containers & their data
Rancher RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers
Kontena Application Containers for Masses
Weave Weaving Docker containers into applications http://www.weave.works/
Zodiac A lightweight tool for easy deployment and rollback of dockerized applications
cAdvisor Analyzes resource usage and performance characteristics of running containers
Micro S3 persistence Docker microservice for saving/restoring volume data to S3
Dockup Docker image to backup/restore your Docker container volumes to AWS S3
Rocker-compose Docker composition tool with idempotency features for deploying apps composed of multiple containers.
Nomad Nomad is a cluster manager, designed for both long lived services and short lived batch processing workloads
ImageLayers Vizualize docker images and the layers that compose them

Datasets

Realtime

Instagram Realtime Real-time photo updates provide your application with instant notifications of new photos as they are posted on Instagram.
Twitter Realtime The Streaming APIs give developers low latency access to Twitter’s global stream of Tweet data.
Firebase Realtime Airport delays, Parking, Cryptocurrencies, Earthquakes, Transit, Weather
Eventsim Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.
Reddit Real-time data is available including comments, submissions and links posted to reddit

Data Dumps

GitHub Archive GitHub's public timeline since 2011, updated every hour
Common Crawl Open source repository of web crawl data
Wikipedia Wikipedia's complete copy of all wikis, in the form of wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available.

Monitoring

Prometheus

Prometheus.io An open-source service monitoring system and time series database
HAProxy Exporter Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption

Community

Forums

/r/dataengineering News, tips and background on Data Engineering
/r/etl Subreddit focused on ETL

Conferences

DataEngConf DataEngConf is the first technical conference that bridges the gap between data scientists, data engineers and data analysts.

Cheers to The Data Engineering Ecosystem: An Interactive Map

Inspired by the awesome list. Created by Insight Data Engineering fellows.

License

To the extent possible under law, Igor Barinov has waived all copyright and related or neighboring rights to this work.

tanpv/awesome-data-engineering