A "stackable" Hadoop network with simple setup and teardown!
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
Hadoop is a distributed computing solution developed in 2002 by the Apache Software Foundation. Hadoop relies on nodes of a cluster to distribute data via a Map-Reduce structure. The main components of Hadoop are:
-
Namenode: The master node which splits tasks to other nodes.
-
Datanode(s): A slaves to the namenode, to carry out computation
-
Hadoop Distributed File System (HDFS): A file system which is accessible by all nodes in the system.
-
Yet Another Resource Manager (YARN): Manages resources and scheduling for the network
Additionally, there exist many supplementary software solutions for Hadoop (Look Here), but we will be focusing on the following:
- Spark: A multi language solution which can sit on top of Hadoop, or replace it entirely. Mainly used as a Scala or Python tool.
- Hive: A JDBC data warehouse solution which allows the user to query data in HDFS using SQL.
- Pig: Simple shell for performing map-reduce problems on HDFS or locally.
Hadoop can be distributed over either a physical or virtual cluster, but most commonly requires the use of SSH to transfer data between nodes. This requires setup on each node and is often semi-automated or completely manual. The system has a defined set of home nodes across the cluster. This is a delicate task, which can take many man-hours to complete.
Employ Docker to containerize this process.
Docker utilizes images which can be deployed on almost any OS. It utilizes layers to parse out instructions to each container. Instructions
are loaded into the image at the build step, allowing only the software which is necessary to the system. When the image is deployed it follows
the stored recipe, allowing for simple configuration without the fear of dependency issues. These configured containers can be stopped and started
with a simple command docker up
or docker down
. Because each node in our system requires Hadoop to be installed, can reuse a set hadoop image, with differing startup commands.
Docker Compose simplifies this process even further, by allowing for yaml based configuration of multiple nodes at once. Additionally all deployed nodes are placed within a subnet created at
runtime. This means each node can "see" each other within the localhost
and thus can give instructions without the use of SSH. Composing creates a stack
which can be monitored and managed via a single interface. Nodes can be scaled up or down depending on user needs. Because it is virtualized, there is no need to fear
breaking the network, as it can be rebuilt simply on the stack.
This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.
- docker-compose or with
pip install docker-compose
- VirtualBox
- Git
Clone the repo (that's it!)
git clone https://github.com/ppfenning/docker-hadoop.git
- Docker Hadoop: The base image which other images are build from. This image uses the latest release of Ubuntu. and only installs Java 8 and Hadoop 3.3.3. It should be noted that all software versions can be changed via configuration files.
- Docker Pig: Installs Pig 0.17.0
- Docker Hive: Installs Hive 3.1.3.
- Namenode: First node to start. Prerequisite for any other node type. Uses
docker-hadoop
. - Datanode: Slave to namenode. Intended to be scaled up or down based on workload. Uses
docker-hadoop
. - Resource Manager: YARN node which tracks tasks from namenode. Uses
docker-hadoop
. - Node Manager: YARN node which tracks activity and heartbeat of all other nodes. Uses
docker-hadoop
. - Pignode: Single node which sits atop of HDFS. Add-on to namenode. Access to
grunt
shell. Usesdocker-hadoop-pig
. - Hivenode: Single node which sits atop of HDFS. Add-on to namenode. Access to
beehive
shell. Usesdocker-hadoop-hive
.
- Change your working directory to the downloaded repo
- Make the network:
- Build local images from scratch:
make
- Pull from prebuild images:
make LOCAL=0
- Build local images from scratch:
NOTE: This command will build all necessary images and bring the network online. This takes a few minutes on the initial run. However, if you have a stored image this process takes only a few seconds.
- Scale Datanodes:
make datanode-scaled WORKERS={DESIRED NODE COUNT}
NOTE: Because
docker-compose
builds containers in parallel, the scale option fails when port mapping is required. To solve this, usedocker build
and manually assign container name and port.
BUG: I am unsure how to supress the orphans warning on docker-compose v2+
- Hadoop single namenode
- Namenode with single datanode
- Namenode with 2 datanodes
- Namenode with N datanodes (max 6)
- Resource and Node managers
- History server
- Pig terminal node
- Hive terminal node
- Scale datanodes to active network
- Example Compose files to run jobs
- Terminal endpoints for
- Namenode
- Pig
- Hive
- Build with spark backend
- Create CLI rather than just Makefile
- Dynamic node scaling
- Hive and Pig and "plug-ins" to namenode
- Deploy to Kubernetes
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Copyright 2022 Patrick Pfenning
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
- Patrick Pfenning - Data Science Master's candidate at Wentworth - ppfenning@wit.edu
- Github: ppfenning
- Project Link: docker-hadoop
Use this space to list resources you find helpful and would like to give credit to. I've included a few of my favorites to kick things off!
This project was forked from Big Data Europe's repo. I couldn't have completed this without the base!
Hadoop, Spark, Pig and Hive are all open source Apache solutions!
I learned a ton about docker in this project...
Helped a ton with property setup
My professor for Big Data Systems