/docker-hadoop

Apache Hadoop docker image using ubuntu

Primary LanguageShellApache License 2.0Apache-2.0

Contributors Forks Stargazers Issues LinkedIn

Docker Hadoop

Logo

A "stackable" Hadoop network with simple setup and teardown!
Explore the docs »

View Demo · Report Bug · Request Feature

About The Project

Docker Hadoop

Hadoop is a distributed computing solution developed in 2002 by the Apache Software Foundation. Hadoop relies on nodes of a cluster to distribute data via a Map-Reduce structure. The main components of Hadoop are:

  • Namenode: The master node which splits tasks to other nodes.

  • Datanode(s): A slaves to the namenode, to carry out computation

  • Hadoop Distributed File System (HDFS): A file system which is accessible by all nodes in the system.

  • Yet Another Resource Manager (YARN): Manages resources and scheduling for the network

Additionally, there exist many supplementary software solutions for Hadoop (Look Here), but we will be focusing on the following:

  • Spark: A multi language solution which can sit on top of Hadoop, or replace it entirely. Mainly used as a Scala or Python tool.
  • Hive: A JDBC data warehouse solution which allows the user to query data in HDFS using SQL.
  • Pig: Simple shell for performing map-reduce problems on HDFS or locally.

Proposal

Problem

Hadoop can be distributed over either a physical or virtual cluster, but most commonly requires the use of SSH to transfer data between nodes. This requires setup on each node and is often semi-automated or completely manual. The system has a defined set of home nodes across the cluster. This is a delicate task, which can take many man-hours to complete.

Solution:

Employ Docker to containerize this process.

Why Docker?

Docker utilizes images which can be deployed on almost any OS. It utilizes layers to parse out instructions to each container. Instructions are loaded into the image at the build step, allowing only the software which is necessary to the system. When the image is deployed it follows the stored recipe, allowing for simple configuration without the fear of dependency issues. These configured containers can be stopped and started with a simple command docker up or docker down. Because each node in our system requires Hadoop to be installed, can reuse a set hadoop image, with differing startup commands.

Docker Compose simplifies this process even further, by allowing for yaml based configuration of multiple nodes at once. Additionally all deployed nodes are placed within a subnet created at runtime. This means each node can "see" each other within the localhost and thus can give instructions without the use of SSH. Composing creates a stack which can be monitored and managed via a single interface. Nodes can be scaled up or down depending on user needs. Because it is virtualized, there is no need to fear breaking the network, as it can be rebuilt simply on the stack.

(back to top)

Getting Started

This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.

Prerequisites

Helpful (but not needed)

Installation

Clone the repo (that's it!)

git clone https://github.com/ppfenning/docker-hadoop.git

Images:

  1. Docker Hadoop: The base image which other images are build from. This image uses the latest release of Ubuntu. and only installs Java 8 and Hadoop 3.3.3. It should be noted that all software versions can be changed via configuration files.
  2. Docker Pig: Installs Pig 0.17.0
  3. Docker Hive: Installs Hive 3.1.3.

Containers:

  1. Namenode: First node to start. Prerequisite for any other node type. Uses docker-hadoop.
  2. Datanode: Slave to namenode. Intended to be scaled up or down based on workload. Uses docker-hadoop.
  3. Resource Manager: YARN node which tracks tasks from namenode. Uses docker-hadoop.
  4. Node Manager: YARN node which tracks activity and heartbeat of all other nodes. Uses docker-hadoop.
  5. Pignode: Single node which sits atop of HDFS. Add-on to namenode. Access to grunt shell. Uses docker-hadoop-pig.
  6. Hivenode: Single node which sits atop of HDFS. Add-on to namenode. Access to beehive shell. Uses docker-hadoop-hive.

Deployment:

Create

  1. Change your working directory to the downloaded repo
  2. Make the network:
    • Build local images from scratch: make
    • Pull from prebuild images: make LOCAL=0

img.png

NOTE: This command will build all necessary images and bring the network online. This takes a few minutes on the initial run. However, if you have a stored image this process takes only a few seconds.

Scale Datanodes

  • Scale Datanodes: make datanode-scaled WORKERS={DESIRED NODE COUNT}

img.png

NOTE: Because docker-compose builds containers in parallel, the scale option fails when port mapping is required. To solve this, use docker build and manually assign container name and port.

BUG: I am unsure how to supress the orphans warning on docker-compose v2+

(back to top)

TODO

  • Hadoop single namenode
  • Namenode with single datanode
  • Namenode with 2 datanodes
  • Namenode with N datanodes (max 6)
  • Resource and Node managers
  • History server
  • Pig terminal node
  • Hive terminal node
  • Scale datanodes to active network
  • Example Compose files to run jobs
  • Terminal endpoints for
    • Namenode
    • Pig
    • Hive
  • Build with spark backend
  • Create CLI rather than just Makefile
  • Dynamic node scaling
  • Hive and Pig and "plug-ins" to namenode
  • Deploy to Kubernetes

See the open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Copyright 2022 Patrick Pfenning

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

(back to top)

Contact

(back to top)

Acknowledgments

Use this space to list resources you find helpful and would like to give credit to. I've included a few of my favorites to kick things off!

This project was forked from Big Data Europe's repo. I couldn't have completed this without the base!

Hadoop, Spark, Pig and Hive are all open source Apache solutions!

I learned a ton about docker in this project...

Helped a ton with property setup

My professor for Big Data Systems

(back to top)