Awesome Big Data
A curated list of awesome big data frameworks, resources and other awesomeness. Inspired by 29918โญ
5163๐ด
awesome-php), 197046โญ
24645๐ด
awesome-python), 1246โญ
182๐ด
awesome-ruby), hadoopecosystemtable & big-data.
Your contributions are always welcome!
- Awesome Big Data
- RDBMS
- Frameworks
- Distributed Programming
- Distributed Filesystem
- Distributed Index
- Document Data Model
- Key Map Data Model
- Key-value Data Model
- Graph Data Model
- Columnar Databases
- NewSQL Databases
- Time-Series Databases
- SQL-like processing
- Data Ingestion
- Service Programming
- Scheduling
- Machine Learning
- Benchmarking
- Security
- System Deployment
- Applications
- Search engine and framework
- MySQL forks and evolutions
- PostgreSQL forks and evolutions
- Memcached forks and evolutions
- Embedded Databases
- Business Intelligence
- Data Visualization
- Internet of things and sensor data
- Interesting Readings
- Interesting Papers
- Videos
- Books
- Other Awesome Lists
RDBMS
- ๐ MySQL The world's most popular open source database.
- ๐ PostgreSQL The world's most advanced open source database.
- Oracle Database - object-relational database management system.
- Teradata - high-performance MPP data warehouse platform.
Frameworks
1036โญ
158๐ด
Bistro) - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via functions and processes data via column operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.- ๐ IBM Streams - platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)
- Apache Hadoop - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).
282โญ
34๐ด
Tigon) - High Throughput Real-time Stream Processing Framework.- Pachyderm - Pachyderm is a data storage platform built on Docker and Kubernetes to provide reproducible data processing and analysis.
3446โญ
322๐ด
Polyaxon) - A platform for reproducible and scalable machine learning and deep learning.379โญ
355๐ด
Smooks) - An extensible Java framework for building XML and non-XML (CSV, EDI, Java, etc...) streaming applications.
Distributed Programming
436โญ
89๐ด
AddThis Hydra) - distributed data processing and storage system originally developed at AddThis.- AMPLab SIMR - run Spark on Hadoop MapReduce v1.
- ๐ Apache APEX - a unified, enterprise platform for big data stream and batch processing.
- ๐ Apache Beam - an unified model and set of language-specific SDKs for defining and executing data processing workflows.
- Apache Crunch - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
- Apache DataFu - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
- Apache Flink - high-performance runtime, and automatic program optimization.
- Apache Gearpump - real-time big data streaming engine based on Akka.
- Apache Gora - framework for in-memory data model and persistence.
- Apache Hama - BSP (Bulk Synchronous Parallel) computing framework.
- ๐ Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
- ๐ Apache Pig - high level language to express data analysis programs for Hadoop.
- Apache REEF - retainable evaluator execution framework to simplify and unify the lower layers of big data systems.
- Apache S4 - framework for stream processing, implementation of S4.
- Apache Spark - framework for in-memory cluster computing.
- ๐ Apache Spark Streaming - framework for stream processing, part of Spark.
- Apache Storm - framework for stream processing by Twitter also on YARN.
- Apache Samza - stream processing framework, based on Kafka and YARN.
- Apache Tez - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
- ๐ Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
- Baidu Bigflow - an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.
- Cascalog - data processing and querying library.
- Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
- Concurrent Cascading - framework for data management/analytics on Hadoop.
257โญ
19๐ด
Damballa Parkour) - MapReduce library for Clojure.56โญ
13๐ด
Datasalt Pangool) - alternative MapReduce paradigm.- ๐ DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
- ๐ Facebook Corona - Hadoop enhancement which removes single point of failure.
- Facebook Peregrine - Map Reduce framework.
- ๐ Facebook Scuba - distributed in-memory datastore.
- ๐ Google Dataflow - create data pipelines to help themรฆingest, transform and analyze data.
- ๐ Google MapReduce - map reduce framework.
- ๐ Google MillWheel - fault tolerant stream processing framework.
- ๐ IBM Streams - platform for distributed processing and real-time analytics. Provides toolkits for advanced analytics like geospatial, time series, etc. out of the box.
- ๐ JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
- Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
- Metamarkets Druid - framework for real-time analysis of large datasets.
552โญ
60๐ด
Netflix PigPen) - map-reduce for Clojure which compiles to Apache Pig.- Nokia Disco - MapReduce framework developed by Nokia.
- Onyx - Distributed computation for the cloud.
- ๐ Pinterest Pinlater - asynchronous job execution system.
- Pydoop - Python MapReduce and HDFS API for Hadoop.
29812โญ
5137๐ด
Ray) - A fast and simple framework for building and running distributed applications.- Rackerlabs Blueflood - multi-tenant distributed metric processing system
398โญ
56๐ด
Skale) - High performance distributed data processing in NodeJS.- Stratosphere - general purpose cluster computing framework.
- ๐ Streamdrill - useful for counting activities of event streams over different time windows and finding the most active one.
28โญ
43๐ด
streamsx.topology) - Libraries to enable building IBM Streams application in Java, Python or Scala.58โญ
16๐ด
Tuktu) - Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!3638โญ
620๐ด
Twitter Heron) - Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter replacing Storm.3464โญ
717๐ด
Twitter Scalding) - Scala library for Map Reduce jobs, built on Cascading.2137โญ
271๐ด
Twitter Summingbird) - Streaming MapReduce with Scalding and Storm, by Twitter.- ๐ Twitter TSAR - TimeSeries AggregatoR by Twitter.
- Wallaroo - The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.
Distributed Filesystem
1711โญ
287๐ด
Ambry) - a distributed object store that supports storage of trillion of small immutable objects as well as billions of large objects.- Apache HDFS - a way to store large files across multiple machines.
- Apache Kudu - Hadoop's storage layer to enable fast analytics on fast data.
- ๐ BeeGFS - formerly FhGFS, parallel distributed file system.
- Ceph Filesystem - software storage platform designed.
- Disco DDFS - distributed filesystem.
- ๐ Facebook Haystack - object storage system.
- Google GFS - distributed filesystem.
- ๐ Google Megastore - scalable, highly available storage.
- ๐ GridGain - GGFS, Hadoop compliant in-memory file system.
- Lustre file system - high-performance distributed filesystem.
- ๐ Microsoft Azure Data Lake Store - HDFS-compatible storage in Azure cloud
- ๐ Quantcast File System QFS - open-source distributed file system.
- Red Hat GlusterFS - scale-out network-attached storage file system.
20279โญ
2144๐ด
Seaweed-FS) - simple and highly scalable distributed file system.- Alluxio - reliable file sharing at memory speed across cluster frameworks.
- ๐ Tahoe-LAFS - decentralized cloud storage system.
2845โญ
577๐ด
Baidu File System) - distributed filesystem.
Distributed Index
2514โญ
235๐ด
Pilosa) Open source distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.
Document Data Model
- ๐ Actian Versant - commercial object-oriented database management systems .
- ๐ Crate Data - is an open source massively scalable data store. It requires zero administration.
- Facebook Apollo - Facebookโs Paxos-like NoSQL database.
- jumboDB - document oriented datastore over Hadoop.
- ๐ LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
- MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
- ๐ Microsoft Azure DocumentDB - NoSQL cloud database service with protocol support for MongoDB
- ๐ MongoDB - Document-oriented database system.
- ๐ RavenDB - A transactional, open-source Document Database.
- ๐ RethinkDB - document database that supports queries like table joins and group by.
Key Map Data Model
Note: There is some term confusion in the industry, and two different things are called "Columnar Databases". Some, listed here, are distributed, persistent databases built around the "key-map" data model: all data has a (possibly composite) key, with which a map of key-value pairs is associated. In some systems, multiple such value maps can be associated with a key, and these maps are referred to as "column families" (with value map keys being referred to as "columns").
Another group of technologies that can also be called "columnar databases" is distinguished by how it stores data, on disk or in memory -- rather than storing data the traditional way, where all column values for a given key are stored next to each other, "row by row", these systems store all column values next to each other. So more work is needed to get all columns for a given key, but less work is needed to get all values for a given column.
The former group is referred to as "key map data model" here. The line between these and the Key-value Data Model stores is fairly blurry.
The latter, being more about the storage format than about the data model, is listed under Columnar Databases.
You can read more about this distinction on Prof. Daniel Abadi's blog: Distinguishing two major types of Column Stores.
- Apache Accumulo - distributed key/value store, built on Hadoop.
- Apache Cassandra - column-oriented distributed datastore, inspired by BigTable.
- Apache HBase - column-oriented distributed datastore, inspired by BigTable.
1876โญ
453๐ด
Baidu Tera) - an Internet-scale database, inspired by BigTable.- ๐ Facebook HydraBase - evolution of HBase made by Facebook.
- Google BigTable - column-oriented distributed datastore.
- ๐ Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable.
- Hypertable - column-oriented distributed datastore, inspired by BigTable.
?โญ
?๐ด
InfiniDB) - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.158โญ
45๐ด
Tephra) - Transactions for HBase.- ๐ Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.
- ScyllaDB - column-oriented distributed datastore written in C++, totally compatible with Apache Cassandra.
Key-value Data Model
- Aerospike - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies."
- ๐ Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
- ๐ Badger - a fast, simple, efficient, and persistent key-value store written natively in Go.
13877โญ
1554๐ด
Bolt) - an embedded key-value database for Go.132โญ
42๐ด
BTDB) - Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more4312โญ
287๐ด
BuntDB) - a fast, embeddable, in-memory key/value database for Go with custom indexing and geospatial support.466โญ
38๐ด
Edis) - is a protocol-compatible Server replacement for Redis.555โญ
56๐ด
ElephantDB) - Distributed database specialized in exporting data from Hadoop.- ๐ EventStore - distributed time series database.
747โญ
44๐ด
GhostDB) - a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.416โญ
26๐ด
Graviton) - a simple, fast, versioned, authenticated, embeddable key-value store database in pure Go(lang).2247โญ
4952๐ด
GridDB) - suitable for sensor data stored in a timeseries.1394โญ
168๐ด
HyperDex) - a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.- ๐ Ignite - is an in-memory key-value data store providing full SQL-compliant data access that can optionally be backed by disk storage.
26โญ
4๐ด
LinkedIn Krati) - is a simple persistent data store with very low latency and high throughput.- Linkedin Voldemort - distributed key/value storage system.
- Oracle NoSQL Database - distributed key-value database by Oracle Corporation.
- ๐ Redis - in memory key value datastore.
3879โญ
571๐ด
Riak) - a decentralized datastore.466โญ
92๐ด
Storehaus) - library to work with asynchronous key value stores, by Twitter.1405โญ
80๐ด
SummitDB) - an in-memory, NoSQL key/value database, with disk persistance and using the Raft consensus algorithm.3294โญ
371๐ด
Tarantool) - an efficient NoSQL database and a Lua application server.14161โญ
2056๐ด
TiKV) - a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.8821โญ
556๐ด
Tile38) - a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON176โญ
22๐ด
TreodeDB) - key-value store that's replicated and sharded and provides atomic multirow writes.
Graph Data Model
- AgensGraph - a new generation multi-model graph database for the modern complex data environment.
- Apache Giraph - implementation of Pregel, based on Hadoop.
- Apache Spark Bagel - implementation of Pregel, part of Spark.
- ๐ ArangoDB - multi model distributed database.
19888โญ
1507๐ด
DGraph) - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user queries, over terabytes of structured data.986โญ
49๐ด
EliasDB) - a lightweight graph based database that does not require any third-party libraries.- ๐ Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
1724โญ
365๐ด
GCHQ Gaffer) - Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.14727โญ
1297๐ด
Google Cayley) - open-source graph database.- Google Pregel - graph processing framework.
- ๐ GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
- ๐ GraphX - resilient Distributed Graph System on Spark.
1944โญ
240๐ด
Gremlin) - graph traversal Language.147โญ
21๐ด
Infovore) - RDF-centric Map/Reduce framework.- ๐ Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.
- JanusGraph - open-source, distributed graph database with multiple options for storage backends (Bigtable, HBase, Cassandra, etc.) and indexing backends (Elasticsearch, Solr, Lucene).
- ๐ MapGraph - Massively Parallel Graph processing on GPUs.
2158โญ
352๐ด
Microsoft Graph Engine) - a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general distributed computation engine.- ๐ Neo4j - graph database written entirely in Java.
- OrientDB - document and graph database.
382โญ
37๐ด
Phoebus) - framework for large scale graph processing.- Titan - distributed graph database, built over Cassandra.
3327โญ
272๐ด
Twitter FlockDB) - distributed graph database.- ๐ NodeXL - A free, open-source template for Microsoftยฎ Excelยฎ 2007, 2010, 2013 and 2016 that makes it easy to explore network graphs.
Columnar Databases
Note please read the note on Key-Map Data Model section.
- Columnar Storage - an explanation of what columnar storage is and when you might want it.
- Actian Vector - column-oriented analytic database.
- ๐ ClickHouse - an open-source column-oriented database management system that allows generating analytical data reports in real time.
- EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
- ๐ MonetDB - column store database.
- Parquet - columnar storage format for Hadoop.
- ๐ Pivotal Greenplum - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.
- ๐ Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
- SQream DB - A GPU powered big data database, designed for analytics and data warehousing, with ANSI-92 compliant SQL, suitable for data sets from 10TB to 1PB.
- ๐ Google BigQuery - Google's cloud offering backed by their pioneering work on Dremel.
- ๐ Amazon Redshift - Amazon's cloud offering, also based on a columnar datastore backend.
453โญ
138๐ด
IndexR) - an open-source columnar storage format for fast & realtime analytic with big data.1516โญ
72๐ด
LocustDB) - an experimental analytics database aiming to set a new standard for query performance on commodity hardware.
NewSQL Databases
- Actian Ingres - commercially supported, open-source SQL relational database management system.
1883โญ
76๐ด
ActorDB) - a distributed SQL database with the scalability of a KV store, while keeping the query capabilities of a relational database.- Amazon RedShift - data warehouse service, based on PostgreSQL.
887โญ
53๐ด
BayesDB) - statistic oriented SQL database.- Bedrock - a simple, modular, networked and distributed transaction layer built atop SQLite.
- ๐ CitusDB - scales out PostgreSQL through sharding and replication.
28603โญ
3666๐ด
Cockroach) - Scalable, Geo-Replicated, Transactional Datastore.1271โญ
216๐ด
Comdb2) - a clustered RDBMS built on optimistic concurrency control techniques.- Datomic - distributed database designed to enable scalable, flexible and intelligent applications.
- ๐ FoundationDB - distributed database, inspired by F1.
- ๐ Google F1 - distributed SQL database built on Spanner.
- ๐ Google Spanner - globally distributed semi-relational database.
- H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
157โญ
47๐ด
Haeinsa) - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.- ๐ HandlerSocket - NoSQL plugin for MySQL/MariaDB.
- InfiniSQL - infinity scalable RDBMS.
387โญ
27๐ด
KarelDB) - a relational database backed by Apache Kafka.- ๐ Map-D - GPU in-memory database, big data analysis and visualization platform.
- MemSQL - in memory SQL database witho optimized columnar storage on flash.
- NuoDB - SQL/ACID compliant distributed database.
- Oracle TimesTen in-Memory Database - in-memory, relational database management system with persistence and recoverability.
- Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
- ๐ SAP HANA - is an in-memory, column-oriented, relational database management system.
- SenseiDB - distributed, realtime, semi-structured database.
- Sky - database used for flexible, high performance analysis of behavioral data.
- SymmetricDS - open source software for both file and database synchronization.
35686โญ
5748๐ด
TiDB) - TiDB is a distributed SQL database. Inspired by the design of Google F1.- ๐ VoltDB - claims to be fastest in-memory database.
8348โญ
1011๐ด
yugabyteDB) - open source, high-performance, distributed SQL database compatible with PostgreSQL.
Time-Series Databases
- Axibase Time Series Database - Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.
- Chronix - a time series storage built to store time series highly compressed and for fast access times.
- Cube - uses MongoDB to store time series data.
- ๐ Heroic - is a scalable time series database based on Cassandra and Elasticsearch.
- ๐ InfluxDB - a time series database with optimised IO and queries, supports pgsql and influx wire protocols.
- ๐ QuestDB - high-performance, open-source SQL database for applications in financial services, IoT, machine learning, DevOps and observability.
- ๐ IronDB - scalable, general-purpose time series database.
1719โญ
358๐ด
Kairosdb) - similar to OpenTSDB but allows for Cassandra.- M3DB - a distributed time series database that can be used for storing realtime metrics at long retention.
- ๐ Newts - a time series database based on Apache Cassandra.
?โญ
?๐ด
TDengine) - a time series database in C utilizing unique features of IoT to improve read/write throughput and reduce space needed to store data- OpenTSDB - distributed time series database on top of HBase.
- ๐ Prometheus - a time series database and service monitoring system.
3161โญ
309๐ด
Beringei) - Facebook's in-memory time-series database.- TrailDB - an efficient tool for storing and querying series of events.
?โญ
?๐ด
Druid) Column oriented distributed data store ideal for powering interactive applications- Riak-TS Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.
834โญ
87๐ด
Akumuli) Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".?โญ
?๐ด
Rhombus) A time-series object store for Cassandra that handles all the complexity of building wide row indexes.698โญ
45๐ด
Dalmatiner DB) Fast distributed metrics database592โญ
104๐ด
Blueflood) A distributed system designed to ingest and process time series data370โญ
116๐ด
Timely) Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.488โญ
48๐ด
SiriDB) Highly-scalable, robust and fast, open source time series database with cluster functionality.12421โญ
1992๐ด
Thanos) - Thanos is a set of components to create a highly available metric system with unlimited storage capacity using multiple (existing) Prometheus deployments.10331โญ
1046๐ด
VictoriaMetrics) - fast, scalable and resource-effective open-source TSDB compatible with Prometheus. Single-node and cluster versions included
SQL-like processing
- Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
- Apache Drill - framework for interactive analysis, inspired by Dremel.
- ๐ Apache HCatalog - table and storage management layer for Hadoop.
- Apache Hive - SQL-like data warehouse system for Hadoop.
- Apache Calcite - framework that allows efficient translation of queries involving heterogeneous and federated data.
- Apache Phoenix - SQL skin over HBase.
- Aster Database - SQL-like analytic processing for MapReduce.
- ๐ Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
- Concurrent Lingual - SQL-like query language for Cascading.
- Datasalt Splout SQL - full SQL query engine for big datasets.
- ๐ Dremio - an open-source, SQL-like Data-as-a-Service Platform based on Apache Arrow.
- ๐ Facebook PrestoDB - distributed SQL query engine.
- ๐ Google BigQuery - framework for interactive analysis, implementation of Dremel.
5487โญ
453๐ด
Materialize) - is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.- ๐ Invantive SQL - SQL engine for online and on-premise use with integrated local data replication and 70+ connectors.
- ๐ PipelineDB - an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.
- ๐ Pivotal HDB - SQL-like data warehouse system for Hadoop.
- RainstorDB - database for storing petabyte-scale volumes of structured and semi-structured data.
37781โญ
28227๐ด
Spark Catalyst) - is a Query Optimization Framework for Spark and Shark.- ๐ SparkSQL - Manipulating Structured Data Using Spark.
- ๐ Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
- ๐ Stinger - interactive query for Hive.
- Tajo - distributed data warehouse system on Hadoop.
- ๐ Trafodion - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.
Data Ingestion
- ๐ redpanda - A Kafkaยฎ replacement for mission critical systems; 10x faster. Written in C++.
- ๐ Amazon Kinesis - real-time processing of streaming data at massive scale.
- ๐ Amazon Web Services Glue - serverless fully managed extract, transform, and load (ETL) service
- ๐ Census - A reverse ETL product that let you sync data from your data warehouse to SaaS Applications. No engineering favors requiredโjust SQL.
- Apache Chukwa - data collection system.
- Apache Flume - service to manage large amount of log data.
- Apache Kafka - distributed publish-subscribe messaging system.
- ๐ Apache NiFi - Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.
13553โญ
3515๐ด
Apache Pulsar) - a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.- Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.
- Embulk - open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.
3925โญ
803๐ด
Facebook Scribe) - streamed log data aggregator.- Fluentd - tool to collect events and logs.
505โญ
49๐ด
Gazette) - Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.- ๐ Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
3399โญ
546๐ด
Heka) - open source stream processing software system.90โญ
31๐ด
HIHO) - framework for connecting disparate data sources with Hadoop.?โญ
?๐ด
Kestrel) - distributed message queue system.- ๐ LinkedIn Databus - stream of change capture events for a database.
22โญ
8๐ด
LinkedIn Kamikaze) - utility package for compressing sorted integer arrays.191โญ
66๐ด
LinkedIn White Elephant) - log aggregator and dashboard.- ๐ Logstash - a tool for managing events and logs.
787โญ
180๐ด
Netflix Suro) - log agregattor like Storm and Samza based on Chukwa.1830โญ
555๐ด
Pinterest Secor) - is a service implementing Kafka log persistance.2184โญ
791๐ด
Linkedin Gobblin) - linkedin's universal data ingestion framework.774โญ
64๐ด
Skizze) - sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.?โญ
?๐ด
StreamSets Data Collector) - continuous big data ingest infrastructure with a simple to use IDE.- ๐ Alooma - data pipeline as a service enabling moving data sources such as MySQL into data warehouses.
3852โญ
289๐ด
RudderStack) - an open source customer data infrastructure (segment, mParticle alternative) written in go.456โญ
44๐ด
Zilla) - An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka protocol.
Service Programming
- Akka Toolkit - runtime for distributed, and fault tolerant event-driven applications on the JVM.
- Apache Avro - data serialization system.
- Apache Curator - Java libaries for Apache ZooKeeper.
- Apache Karaf - OSGi runtime that runs on top of any OSGi framework.
- Apache Thrift - framework to build binary protocols.
- Apache Zookeeper - centralized service for process management.
- ๐ Google Chubby - a lock service for loosely-coupled distributed systems.
326โญ
69๐ด
Hydrosphere Mist) - a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.- ๐ Linkedin Norbert - cluster manager.
2041โญ
106๐ด
Mara) - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow- ๐ OpenMPI - message passing framework.
- ๐ Serf - decentralized solution for service discovery and orchestration.
17088โญ
2388๐ด
Spotify Luigi) - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.480โญ
302๐ด
Spring XD) - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.1137โญ
403๐ด
Twitter Elephant Bird) - libraries for working with LZOP-compressed data.- ๐ Twitter Finagle - asynchronous network stack for the JVM.
Scheduling
33406โญ
13462๐ด
Apache Airflow) - a platform to programmatically author, schedule and monitor workflows.- Apache Aurora - is a service scheduler that runs on top of Apache Mesos.
- Apache Falcon - data management framework.
- Apache Oozie - workflow job scheduler.
- ๐ Azure Data Factory - cloud-based pipeline orchestration for on-prem, cloud and HDInsight
- Chronos - distributed and fault-tolerant scheduler.
3079โญ
342๐ด
Cronicle) - Distributed, easy to install, NodeJS based, task scheduler9563โญ
1192๐ด
Dagster) - a data orchestrator for machine learning, analytics, and ETL.- ๐ Linkedin Azkaban - batch workflow job scheduler.
95โญ
28๐ด
Schedoscope) - Scala DSL for agile scheduling of Hadoop jobs.313โญ
90๐ด
Sparrow) - scheduling platform.
Machine Learning
- ๐ Azure ML Studio - Cloud-based AzureML, R, Python Machine Learning platform
8004โญ
974๐ด
brain) - Neural networks in JavaScript.1790โญ
411๐ด
Oryx) - Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning.- Concurrent Pattern - machine learning library for Cascading.
10692โญ
2077๐ด
convnetjs) - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.?โญ
?๐ด
DataVec) - A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.- Deeplearning4j - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
384โญ
62๐ด
Decider) - Flexible and Extensible Machine Learning in Ruby.- ENCOG - machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data.
- etcML - text classification with machine learning.
360โญ
61๐ด
Etsy Conjecture) - scalable Machine Learning in Scalding.5092โญ
909๐ด
Feast) - A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.- ๐ GraphLab Create - A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.
?โญ
?๐ด
H2O) - statistical, machine learning and math runtime with Hadoop. R and Python.2042โญ
236๐ด
Karate Club) - An unsupervised machine learning library for graph structured data. Python60273โญ
19530๐ด
Keras) - An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.1โญ
18๐ด
Lambdo) - Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.687โญ
53๐ด
Little Ball of Fur) - A subsampling library for graph structured data. Python- Mahout - An Apache-backed machine learning library for Hadoop.
- MLbase - distributed machine learning libraries for the BDAS stack.
895โญ
229๐ด
MLPNeuralNet) - Fast multilayer perceptron neural network library for iOS and Mac OS X.3280โญ
432๐ด
ML Workspace) - All-in-one web-based IDE specialized for machine learning and data science.- MOA - MOA performs big data stream mining in real time, and large scale machine learning.
- ๐ MonkeyLearn - Text mining made easy. Extract and classify data from text.
?โญ
?๐ด
ND4J) - A matrix library for the JVM. Numpy for Java.6334โญ
1586๐ด
nupic) - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.- PredictionIO - machine learning server buit on Hadoop, Mahout and Cascading.
2392โญ
332๐ด
PyTorch Geometric Temporal) - a temporal extension library for PyTorch Geometric .?โญ
?๐ด
RL4J) - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.- SAMOA - distributed streaming machine learning framework.
57330โญ
25117๐ด
scikit-learn) - scikit-learn: machine learning in Python.205โญ
34๐ด
Shapley) - A data-driven framework to quantify the value of classifiers in a machine learning ensemble.- Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.
- ๐ Sibyl - System for Large Scale Machine Learning at Google.
180546โญ
89418๐ด
TensorFlow) - Library from Google for machine learning using data flow graphs.- Theano - A Python-focused machine learning library supported by the University of Montreal.
- Torch - A deep learning library with a Lua API, supported by NYU and Facebook.
110โญ
26๐ด
Velox) - System for serving machine learning predictions.?โญ
?๐ด
Vowpal Wabbit) - learning system sponsored by Microsoft and Yahoo!.- WEKA - suite of machine learning software.
913โญ
172๐ด
BidMach) - CPU and GPU-accelerated Machine Learning Library.
Benchmarking
- ๐ Apache Hadoop Benchmarking - micro-benchmarks for testing Hadoop performances.
?โญ
?๐ด
Berkeley SWIM Benchmark) - real-world big data workload benchmark.1410โญ
762๐ด
Intel HiBench) - a Hadoop benchmark suite.- ๐ PUMA Benchmarking - benchmark suite for MapReduce applications.
- Yahoo Gridmix3 - Hadoop cluster benchmarking from Yahoo engineer team.
?โญ
?๐ด
Deeplearning4j Benchmarks)41โญ
6๐ด
UCSB) - extended Yahoo Cloud Serving Benchmark for NoSQL databases.
Security
- Apache Ranger - Central security admin & fine-grained authorization for Hadoop
- Apache Eagle - real time monitoring solution
- Apache Knox Gateway - single point of secure access for Hadoop clusters.
- Apache Sentry - security module for data stored in Hadoop.
?โญ
?๐ด
BDA) - The vulnerability detector for Hadoop and Spark
System Deployment
- Apache Ambari - operational framework for Hadoop mangement.
- Apache Bigtop - system deployment framework for the Hadoop ecosystem.
- Apache Helix - cluster management framework.
- Apache Mesos - cluster manager.
77โญ
76๐ด
Apache Slider) - is a YARN application to deploy existing distributed applications on YARN.- Apache Whirr - set of libraries for running cloud services.
- ๐ Apache YARN - Cluster manager.
- Brooklyn - library that simplifies application deployment and management.
- Buildoop - Similar to Apache BigTop based on Groovy language.
- Cloudera HUE - web application for interacting with Hadoop.
- Facebook Prism - multi datacenters replication system.
- ๐ Google Borg - job scheduling and monitoring system.
- ๐ Google Omega - job scheduling and monitoring system.
- ๐ Hortonworks HOYA - application that can deploy HBase cluster on YARN.
- ๐ Kubernetes - a system for automating deployment, scaling, and management of containerized applications.
4069โญ
869๐ด
Marathon) - Mesos framework for long-running services.3206โญ
1140๐ด
Linkis) - Linkis helps easily connect to various back-end computation/storage engines.
Applications
967โญ
122๐ด
411) - an web application for alert management resulting from scheduled searches into Elasticsearch.334โญ
60๐ด
Adobe spindle) - Next-generation web analytics processing with Scala, Spark, and Parquet.- Apache Metron - a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.
- Apache Nutch - open source web crawler.
- Apache OODT - capturing, processing and sharing of data for NASA's scientific archives.
- ๐ Apache Tika - content analysis toolkit.
504โญ
148๐ด
Argus) - Time series monitoring and alerting platform.1222โญ
297๐ด
AthenaX) - a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).3370โญ
284๐ด
Atlas) - a backend for managing dimensional time series data.- ๐ Countly - open source mobile and web analytics platform, based on Node.js & MongoDB.
- ๐ Domino - Run, scale, share, and deploy models โ without any infrastructure.
- Eclipse BIRT - Eclipse-based reporting system.
7954โญ
1771๐ด
ElastAert) - ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.1328โญ
143๐ด
Eventhub) - open source event analytics platform.- ๐ HASH - open source simulation and visualization platform.
780โญ
215๐ด
Hermes) - asynchronous message broker built on top of Kafka.- ๐ Hunk - Splunk analytics for Hadoop.
- Imhotep - Large scale analytics platform by indeed.
- ๐ Indicative - Web & mobile analytics tool, with data warehouse (AWS, BigQuery) integration.
- ๐ Jupyter - Notebook and project application for interactive data science and scientific computing across all programming languages.
- MADlib - data-processing library of an RDBMS to analyze data.
2266โญ
501๐ด
Kapacitor) - an open source framework for processing, monitoring, and alerting on time series data.- Kylin - open source Distributed Analytics Engine from eBay.
121โญ
47๐ด
PivotalR) - R on Pivotal HD / HAWQ and PostgreSQL.796โญ
105๐ด
Rakam) - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.- ๐ Qubole - auto-scaling Hadoop cluster, built-in data connectors.
1036โญ
206๐ด
SnappyData) - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.6691โญ
1222๐ด
Snowplow) - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.- SparkR - R frontend for Spark.
- ๐ Splunk - analyzer for machine-generated data.
- ๐ Sumo Logic - cloud based analyzer for machine-generated data.
246โญ
13๐ด
Substation) - Substation is a cloud native data pipeline and transformation toolkit written in Go.- Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.
Search engine and framework
- Apache Lucene - Search engine library.
- Apache Solr - Search platform for Apache Lucene.
1702โญ
208๐ด
Elassandra) - is a fork of Elasticsearch modified to run on top of Apache Cassandra in a scalable and resilient peer-to-peer architecture.- ๐ ElasticSearch - Search and analytics engine based on Apache Lucene.
- ๐ Enigma.io โ Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web.
- ๐ Google Caffeine - continuous indexing system.
- ๐ Google Percolator - continuous indexing system.
- ๐ HBase Coprocessor - implementation of Percolator, part of HBase.
- Lily HBase Indexer - quickly and easily search for any content stored in HBase.
- LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
559โญ
78๐ด
LinkedIn Cleo) - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.- ๐ LinkedIn Galene - search architecture at LinkedIn.
365โญ
131๐ด
LinkedIn Zoie) - is a realtime search/indexing system written in Java.- MG4J - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and new research algorithms.
- Sphinx Search Server - fulltext search engine.
- Vespa - is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.
26700โญ
3327๐ด
Facebook Faiss) - is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.12427โญ
1174๐ด
Annoy) - is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.8781โญ
581๐ด
Weaviate) - Weaviate is a GraphQL-based semantic search engine with build-in (word) embeddings.
MySQL forks and evolutions
- ๐ Amazon RDS - MySQL databases in Amazon's cloud.
- Drizzle - evolution of MySQL 6.0.
- ๐ Google Cloud SQL - MySQL databases in Google's cloud.
- ๐ MariaDB - enhanced, drop-in replacement for MySQL.
- ๐ MySQL Cluster - MySQL implementation using NDB Cluster storage engine.
- ๐ Percona Server - enhanced, drop-in replacement for MySQL.
24โญ
6๐ด
ProxySQL) - High Performance Proxy for MySQL.- ๐ TokuDB - TokuDB is a storage engine for MySQL and MariaDB.
- WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.
PostgreSQL forks and evolutions
- HadoopDB - hybrid of MapReduce and DBMS.
- IBM Netezza - high-performance data warehouse appliances.
- Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.
- RecDB - Open Source Recommendation Engine Built Entirely Inside PostgreSQL.
- Stado - open source MPP database system solely targeted at data warehousing and data mart applications.
- ๐ Yahoo Everest - multi-peta-byte database / MPP derived by PostgreSQL.
- TimescaleDB - An open-source time-series database optimized for fast ingest and complex queries
- ๐ PipelineDB - The Streaming SQL Database. An open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables
Memcached forks and evolutions
- ๐ Facebook McDipper - key/value cache for flash storage.
- ๐ Facebook Memcached - fork of Memcache.
11979โญ
2067๐ด
Twemproxy) - A fast, light-weight proxy for memcached and redis.1304โญ
186๐ด
Twitter Fatcache) - key/value cache for flash storage.926โญ
166๐ด
Twitter Twemcache) - fork of Memcache.
Embedded Databases
- Actian PSQL - ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications.
- ๐ BerkeleyDB - a software library that provides a high-performance embedded database for key/value data.
299โญ
56๐ด
HanoiDB) - Erlang LSM BTree Storage.34511โญ
7713๐ด
LevelDB) - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.- ๐ LMDB - ultra-fast, ultra-compact key-value embedded data store developed by Symas.
- RocksDB - embeddable persistent key-value store for fast storage based on LevelDB.
Business Intelligence
- ๐ BIME Analytics - business intelligence platform in the cloud.
3782โญ
465๐ด
Blazer) - business intelligence made simple.- ๐ Chartio - lean business intelligence platform to visualize and explore your data.
- ๐ Count - notebook-based anlytics and visualisation platform using SQL or drag-and-drop.
- ๐ datapine - self-service business intelligence tool in the cloud.
- ๐ Dekart - Large scale geospatial analytics for Google BigQuery based on Kepler.gl.
- ๐ GoodData - platform for data products and embedded analytics.
- ๐ Jaspersoft - powerful business intelligence suite.
- ๐ Jedox Palo - customisable Business Intelligence platform.
- ๐ Jethrodata - Interactive Big Data Analytics.
- ๐ intermix.io - Performance Monitoring for Amazon Redshift
35749โญ
4906๐ด
Metabase) - The simplest, fastest way to get business intelligence and analytics to everyone in your company.- Microsoft - business intelligence software and platform.
- ๐ Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
- ๐ Numeracy - Fast, clean SQL client and business intelligence.
- Pentaho - business intelligence platform.
- Qlik - business intelligence and analytics platform.
- ๐ Redash - Open source business intelligence platform, supporting multiple data sources and planned queries.
- ๐ Saiku Analytics - Open source analytics platform.
- ๐ Knowage - open source business intelligence platform. (former SpagoBi)
- SparklineData SNAP - modern B.I platform powered by Apache Spark.
- ๐ Tableau - business intelligence platform.
- ๐ Zoomdata - Big Data Analytics.
Data Visualization
2761โญ
469๐ด
Airpal) - Web UI for PrestoDB.- AnyChart - fast, simple and flexible JavaScript (HTML5) charting library featuring pure JS API.
2649โญ
678๐ด
Arbor) - graph visualization library using web workers and jQuery.671โญ
240๐ด
Banana) - visualize logs and time-stamped data stored in Solr. Port of Kibana.18โญ
5๐ด
Bloomery) - Web UI for Impala.- Bokeh - A powerful Python interactive visualization library that targets modern web browsers for presentation, with the goal of providing elegant, concise construction of novel graphics in the style of D3.js, but also delivering this capability with high-performance interactivity over very large or streaming datasets.
- C3 - D3-based reusable chart library
2704โญ
676๐ด
CartoDB) - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.- chartd - responsive, retina-compatible charts with just an img tag.
- Chart.js - open source HTML5 Charts visualizations.
59โญ
19๐ด
Chartist.js) - another open source HTML5 Charts visualization.- Crossfilter - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
4930โญ
573๐ด
Cubism) - JavaScript library for time series visualization.- Cytoscape - JavaScript library for visualizing complex networks.
- DC.js - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
- ๐ D3 - javaScript library for manipulating documents.
696โญ
24๐ด
D3.compose) - Compose complex, data-driven visualizations from reusable charts and components.- D3Plus - A fairly robust set of reusable charts and styles for d3.js.
20067โญ
2038๐ด
Dash) - Analytical Web Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS required- ๐ Dekart - Large scale geospatial analytics for Google BigQuery based on Kepler.gl.
- ๐ DevExtreme React Chart - High-performance plugin-based React chart for Bootstrap and Material Design.
57857โญ
19688๐ด
Echarts) - Baidus enterprise charts.1566โญ
254๐ด
Envisionjs) - dynamic HTML5 visualization.- ๐ FnordMetric - write SQL queries that return SVG charts rather than tables
- ๐ Frappe Charts - GitHub-inspired simple and modern SVG charts for the web with zero dependencies.
6401โญ
1219๐ด
Freeboard) - pen source real-time dashboard builder for IOT and other web mashups.5612โญ
1570๐ด
Gephi) - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.- ๐ Google Charts - simple charting API.
- ๐ Grafana - graphite dashboard frontend, editor and graph composer.
- Graphite - scalable Realtime Graphing.
- ๐ Highcharts - simple and flexible charting API.
- IPython - provides a rich architecture for interactive computing.
- ๐ Kibana - visualize logs and time-stamped data
- Lumify - open source big data analysis and visualization platform
18846โญ
7400๐ด
Matplotlib) - plotting with Python.- ๐ Metricsgraphic.js - a library built on top of D3 that is optimized for time-series data
- NVD3 - chart components for d3.js.
4222โญ
409๐ด
Peity) - Progressive SVG bar, line and pie charts.- ๐ Plot.ly - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots.
16333โญ
1907๐ด
Plotly.js) The open source javascript graphing library that powers plotly.2140โญ
330๐ด
Recline) - simple but powerful library for building data applications in pure Javascript and HTML.24580โญ
4276๐ด
Redash) - open-source platform to query and visualize data.- ReCharts - A composable charting library built on React components
- Shiny - a web application framework for R.
10912โญ
1612๐ด
Sigma.js) - JavaScript library dedicated to graph drawing.56570โญ
12258๐ด
Superset) - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.10750โญ
1495๐ด
Vega) - a visualization grammar.412โญ
114๐ด
Zeppelin) - a notebook-style collaborative data analysis.- ๐ Zing Charts - JavaScript charting library for big data.
2878โญ
982๐ด
DataSphere Studio) - one-stop data application development management portal.
Internet of things and sensor data
- Apache Edgent (Incubating) - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.
- ๐ Azure IoT Hub - Cloud-based bi-directional monitoring and messaging hub
- ๐ TempoIQ - Cloud-based sensor analytics.
- 2lemetry - Platform for Internet of things.
- ๐ Pubnub - Data stream network
- ๐ ThingWorx - Rapid development and connection of intelligent systems
- ๐ IFTTT - If this then that
- ๐ Evrything- Making products smart
?โญ
?๐ด
NetLytics) - Analytics platform to process network data on Spark.- ๐ Ably - Pub/sub messaging platform for IoT
Interesting Readings
- ๐ Big Data Benchmark - Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.
- ๐ NoSQL Comparison - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.
- ๐ Monitoring Kafka performance - Guide to monitoring Apache Kafka, including native methods for metrics collection.
- ๐ Monitoring Hadoop performance - Guide to monitoring Hadoop, with an overview of Hadoop architecture, and native methods for metrics collection.
- ๐ Monitoring Cassandra performance - Guide to monitoring Cassandra, including native methods for metrics collection.
Interesting Papers
2015 - 2016
- 2015 - Facebook - One Trillion Edges: Graph Processing at Facebook-Scale.
2013 - 2014
- 2014 - Stanford - Mining of Massive Datasets.
- ๐ 2013 - AMPLab - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
- ๐ 2013 - AMPLab - MLbase: A Distributed Machine-learning System.
- ๐ 2013 - AMPLab - Shark: SQL and Rich Analytics at Scale.
- ๐ 2013 - AMPLab - GraphX: A Resilient Distributed Graph System on Spark.
- 2013 - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
- 2013 - Microsoft - Scalable Progressive Analytics on Big Data in the Cloud.
- 2013 - Metamarkets - Druid: A Real-time Analytical Data Store.
- 2013 - Google - Online, Asynchronous Schema Change in F1.
- 2013 - Google - F1: A Distributed SQL Database That Scales.
- 2013 - Google - MillWheel: Fault-Tolerant Stream Processing at Internet Scale.
- 2013 - Facebook - Scuba: Diving into Data at Facebook.
- 2013 - Facebook - Unicorn: A System for Searching the Social Graph.
- ๐ 2013 - Facebook - Scaling Memcache at Facebook.
2011 - 2012
- 2012 - Twitter - The Unified Logging Infrastructure for Data Analytics at Twitter.
- ๐ 2012 - AMPLab - Blink and Itโs Done: Interactive Queries on Very Large Data.
- ๐ 2012 - AMPLab - Fast and Interactive Analytics over Hadoop Data with Spark.
- ๐ 2012 - AMPLab - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.
- ๐ 2012 - Microsoft - Paxos Replicated State Machines as the Basis of a High-Performance Data Store.
- 2012 - Microsoft - Paxos Made Parallel.
- ๐ 2012 - AMPLab - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
- 2012 - Google - Processing a trillion cells per mouse click.
- 2012 - Google - Spanner: Googleโs Globally-Distributed Database.
- ๐ 2011 - AMPLab - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
- ๐ 2011 - AMPLab - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
- 2011 - Google - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.
2001 - 2010
- ๐ 2010 - Facebook - Finding a needle in Haystack: Facebookโs photo storage.
- ๐ 2010 - AMPLab - Spark: Cluster Computing with Working Sets.
- 2010 - Google - Pregel: A System for Large-Scale Graph Processing.
- 2010 - Google - Large-scale Incremental Processing Using Distributed Transactions and Noti๏ฌcations base of Percolator and Caffeine.
- 2010 - Google - Dremel: Interactive Analysis of Web-Scale Datasets.
- 2010 - Yahoo - S4: Distributed Stream Computing Platform.
- 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
- ๐ 2008 - AMPLab - Chukwa: A large-scale monitoring system.
- 2007 - Amazon - Dynamo: Amazonโs Highly Available Key-value Store.
- 2006 - Google - The Chubby lock service for loosely-coupled distributed systems.
- 2006 - Google - Bigtable: A Distributed Storage System for Structured Data.
- 2004 - Google - MapReduce: Simplied Data Processing on Large Clusters.
- 2003 - Google - The Google File System.
Videos
- ๐ Spark in Motion - Spark in Motion teaches you how to use Spark for batch and streaming data analytics.
- ๐ Machine Learning, Data Science and Deep Learning with Python - LiveVideo tutorial that covers machine learning, Tensorflow, artificial intelligence, and neural networks.
- ๐ Data warehouse schema design - dimensional modeling and star schema - Introduction to schema design for data warehouse using the star schema method.
- ๐ Elasticsearch 7 and Elastic Stack - LiveVideo tutorial that covers searching, analyzing, and visualizing big data on a cluster with Elasticsearch, Logstash, Beats, Kibana, and more.
Books
Streaming
- ๐ Data Science at Scale with Python and Dask - Data Science at Scale with Python and Dask teaches you how to build distributed data projects that can handle huge amounts of data.
- ๐ Streaming Data - Streaming Data introduces the concepts and requirements of streaming and real-time data systems.
- ๐ Storm Applied - Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.
- Fundamentals of Stream Processing: Application Design, Systems, and Analytics - This comprehensive, hands-on guide combining the fundamental building blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as students and researchers in the field.
- Stream Data Processing: A Quality of Service Perspective - Presents a new paradigm suitable for stream and complex event processing.
- ๐ Unified Log Processing - Unified Log Processing is a practical guide to implementing a unified log of event streams (Kafka or Kinesis) in your business
- ๐ Kafka Streams in Action - Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to focus on getting more from your data without sacrificing time or effort.
- ๐ Big Data - Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data.
- ๐ Spark in Action & ๐ Spark in Action 2nd Ed. - Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0.
- ๐ Kafka in Action - Kafka in Action is a fast-paced introduction to every aspect of working with Kafka you need to really reap its benefits.
- ๐ Fusion in Action - Fusion in Action teaches you to build a full-featured data analytics pipeline, including document and data search and distributed data clustering.
- ๐ Reactive Data Handling - Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that introduce you to building reactive applications capable of handling real-time processing with large data loads--free eBook!
- ๐ Azure Data Engineering - A book about data engineering in general and the Azure platform specifically
- ๐ Grokking Streaming Systems - Grokking Streaming Systems helps you unravel what streaming systems are, how they work, and whether theyโre right for your business. Written to be tool-agnostic, youโll be able to apply what you learn no matter which framework you choose.
Distributed systems
- Distributed Systems for fun and profit โ Theory of distributed systems. Include parts about time and ordering, replication and impossibility results.
Graph Based approach
- ๐ Graph-Powered Machine Learning - Alessandro Negro. Combine graph theory and models to improve machine learning projects
Data Visualization
- ๐ The beauty of data visualization
- ๐ Designing Data Visualizations with Noah Iliinsky
- ๐ Hans Rosling's 200 Countries, 200 Years, 4 Minutes
- ๐ Ice Bucket Challenge Data Visualization
Other Awesome Lists
- Other awesome lists
31034โญ
3597๐ด
awesome-awesomeness). - Even more lists
288858โญ
27136๐ด
awesome). - Another list?
9408โญ
701๐ด
list). - WTF!
1816โญ
173๐ด
awesome-awesome-awesome). - Analytics
3779โญ
432๐ด
awesome-analytics). - Public Datasets
57836โญ
9734๐ด
awesome-public-datasets). - Graph Classification
4691โญ
755๐ด
awesome-graph-classification). - Network Embedding
2547โญ
508๐ด
awesome-network-embedding). - Community Detection
2244โญ
362๐ด
awesome-community-detection). - Decision Tree Papers
2304โญ
342๐ด
awesome-decision-tree-papers). - Fraud Detection Papers
1507โญ
294๐ด
awesome-fraud-detection-papers). - Gradient Boosting Papers
975โญ
160๐ด
awesome-gradient-boosting-papers). - Monte Carlo Tree Search Papers
576โญ
71๐ด
awesome-monte-carlo-tree-search-papers). - Kafka
189โญ
41๐ด
awesome-kafka). 44โญ
5๐ด
Google Bigtable).
Source
12623โญ
2552๐ด
0xnr/awesome-bigdata)