/awesome-hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources

Awesome Hadoop Awesome

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by Awesome PHP, Awesome Python and Awesome Sysadmin

Hadoop

  • Apache Hadoop - Apache Hadoop
  • Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop
  • SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
  • GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
  • Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
  • dumbo - Python module that allows you to easily write and run Hadoop programs.
  • hadoopy - Python MapReduce library written in Cython.
  • mrjob - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
  • pydoop - Pydoop is a package that provides a Python API for Hadoop.
  • hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
  • White Elephant - Hadoop log aggregator and dashboard
  • Kiji Project
  • Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
  • Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
  • Crunch - Go-based toolkit for ETL and feature extraction on Hadoop
  • Apache Ignite - Distributed in-memory platform

YARN

  • Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
  • Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
  • mpich2-yarn - Running MPICH2 on Yarn

NoSQL

Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.

  • Apache HBase - Apache HBase
  • Apache Phoenix - A SQL skin over HBase supporting secondary indices
  • happybase - A developer-friendly Python library to interact with Apache HBase.
  • Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
  • Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
  • hindex - Secondary Index for HBase
  • Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • OpenTSDB - The Scalable Time Series Database
  • Apache Cassandra

SQL on Hadoop

SQL on Hadoop

  • Apache Hive - The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
  • Apache Phoenix A SQL skin over HBase supporting secondary indices
  • Apache HAWQ (incubating) - Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
  • Lingual - SQL interface for Cascading (MR/Tez job generator)
  • Cloudera Impala
  • Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
  • Apache Tajo - Data warehouse system for Apache Hadoop
  • Apache Drill - Schema-free SQL Query Engine
  • Apache Trafodion

Data Management

  • Apache Calcite - A Dynamic Data Management Framework
  • Apache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies

Workflow, Lifecycle and Governance

  • Apache Oozie - Apache Oozie
  • Azkaban
  • Apache Falcon - Data management and processing platform
  • Apache NiFi - A dataflow system
  • Apache AirFlow - Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
  • Luigi - Python package that helps you build complex pipelines of batch jobs

Data Ingestion and Integration

DSL

  • Apache Pig - Apache Pig
  • Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
  • vahara - Machine learning and natural language processing with Apache Pig
  • packetpig - Open Source Big Data Security Analytics
  • akela - Mozilla's utility library for Hadoop, HBase, Pig, etc.
  • seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
  • Lipstick - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig
  • PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.

Libraries and Tools

Realtime Data Processing

Distributed Computing and Programming

Packaging, Provisioning and Monitoring

  • Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
  • Apache Ambari - Apache Ambari
  • Ganglia Monitoring System
  • ankush - A big data cluster management tool that creates and manages clusters of different technologies.
  • Apache Zookeeper - Apache Zookeeper
  • Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
  • Buildoop - Hadoop Ecosystem Builder
  • Deploop - The Hadoop Deploy System
  • Jumbune - An open source MapReduce profiling, MapReduce flow debugging, HDFS data quality validation and Hadoop cluster monitoring tool.
  • inviso - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.

Search

Search Engine Framework

  • Apache Nutch - Apache Nutch is a highly extensible and scalable open source web crawler software project.

Security

  • Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
  • Apache Sentry - An authorization module for Hadoop
  • Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.

Benchmark

  • Big Data Benchmark
  • HiBench
  • Big-Bench
  • hive-benchmarks
  • hive-testbench - Testbench for experimenting with Apache Hive at any data scale.
  • YCSB - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.

Machine learning and Big Data analytics

  • Apache Mahout
  • Oryx 2 - Lambda architecture on Spark, Kafka for real-time large scale machine learning
  • MLlib - MLlib is Apache Spark's scalable machine learning library.
  • R - R is a free software environment for statistical computing and graphics.
  • RHadoop including RHDFS, RHBase, RMR2, plyrmr
  • RHive RHive, for launching Hive queries from R
  • Apache Lens
  • Apache SINGA (incubating) - SINGA is a general distributed deep learning platform for training big deep learning models over large datasets

Misc.

Resources

Various resources, such as books, websites and articles.

Websites

Useful websites and articles

Presentations

Books

Hadoop and Big Data Events

Other Awesome Lists

Other amazingly awesome lists can be found in the awesome-awesomeness and awesome list.