
List of some interesting projects


This is an attempt to list out all the interesting projects.

It is intended for anyone designing modern large scale architectures and need to choose tools/technoglogies/frameworks. The purpose is to help in making that choices with resources like comparisons/use-cases/features/maturity or really anything that helps in making an informed decision.

TODO: Add links and licenses.


##Distributed Coordination

This are implementations/libraries to help write distributed applications which require some form of coordination.

##Infrastructure Management


##File Systems

##Distribtued Databases

##Infrastrcuture Logging/Monitoring

##Infrastructure Helpers

MultiCloud/CrossCloud utilities



##Generalized Data Processing


  • Tez vs Dryad
  • Hadoop vs Spark - Too many differences, no good link.

##Largescale Distributed ML

##pub-sub / messaging

##Data Ingest

##Graph Storing and/or Processing

##SQL Engines

##Stream Processing


##Performance Analysis

##Workflow engines/DAG-executors/Pipelines


##Configuration Management

##Service Discovery





  • Zoie
  • Norbert - cluster manager and networking layer built on top of Zookeeper.
  • Okapi - Large-scale ML & graph analytics on Giraph
  • Scalding - A Scala API for Cascading
  • SummingBird - Streaming MapReduce with Scalding and Storm
  • Curator - set of Java libraries that make using Apache ZooKeeper much easier
  • Turbine - Low latency high throughput aggregator for real time streams
  • DataFu - Collection of MapReduce lib
  • Twill (Previsously known as Weave) - YARN application writing lib



  • Nutch - web crawler
  • Ambari - Hadoop Deployment + Management
  • Bigtop - Hadoop Packaging
  • Skuld
  • Camus - LinkedIn's Kafka to HDFS pipeline.
  • Kiji - collect, analyze and serve data in real time on Apache Hadoop and HBase