Quickly build arbitrary size Hadoop Cluster, Spark Cluster, Hive based on Docker

1. Project Introduction
2. BDP-Cluster-Docker image Introduction
3. Steps to build a 3 nodes Cluster and Hive MySql MetaStore
4. Steps to build an arbitrary size Cluster

##1. Project Introduction

The objective of this project is to help Hadoop, Spark, Hive developer to quickly build an arbitrary size cluster on their local host. This is achieved by using Docker.

My project is based on kiwenlau/hadoop-cluster-docker project, however, I've added Spark, Hive and changed the base os from Ubuntu-15.04 to CentOS-6.

##2. BDP-Cluster-Docker image Introduction BDP is short for Big Data Platform In this project, I developed 5 docker images: serf-dnsmasq, hive-mysql, hadoop-base, hadoop-master and hadoop-slave.

#####1. serf-dnsmasq

Based on centos:6. serf and dnsmasq are installed for providing DNS service for the Hadoop Cluster.

#####2. hive-mysql For hive metastore. Based on serf-dnsmasq. Installed:


#####3. hadoop-base

Based on serf-dnsmasq. installed:

Apache Hadoop 2.6.4
Apache Hive 1.2.1
Apache Spark 1.5.2

#####4. hadoop-master

Based on hadoop-base. Runs :

Hadoop master node
Spark master node
Hive CLI
Hive HWI
Hive hiveserver2


Based on hadoop-base. Runs:

Hadoop slave node
Spark slave node

##3. steps to build Cluster

#####a. clone source code a large repo... this could take a while

git clone https://github.com/JoeWoo/hadoop-spark-hive-cluster-docker

download hadoop, spark, hive bin files:

cd hadoop-spark-hive-cluster-docker/hadoop-base/files

curl -Lso hadoop-2.6.4.tar.gz http://ftp.tsukuba.wide.ad.jp/software/apache/hadoop/common/hadoop-2.6.4/hadoop-2.6.4.tar.gz
tar -zxvf hadoop-2.6.4.tar.gz
rm hadoop-2.6.4.tar.gz

curl -Lso spark-1.5.2-bin-hadoop2.6.tgz http://mirror.cogentco.com/pub/apache/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz
tar -zxvf  spark-1.5.2-bin-hadoop2.6.tgz
rm spark-1.5.2-bin-hadoop2.6.tgz

curl -Lso apache-hive-1.2.1-bin.tar.gz http://mirror.tcpdiag.net/apache/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gz
tar -zxvf apache-hive-1.2.1-bin.tar.gz
rm apache-hive-1.2.1-bin.tar.gz

#####b. build images

 cd hadoop-spark-hive-cluster-docker

#####c. check building result

$ docker images


joewoo/hadoop-slave                     1.0                 29dcd2a776f5        About an hour ago   1.373 GB
joewoo/hadoop-master                    1.0                 f9b3bdaa3db6        About an hour ago   1.373 GB
joewoo/hadoop-base                      1.0                 291016ffe302        About an hour ago   1.373 GB
joewoo/hive-mysql                       1.0                 0e5af7734696        2 hours ago         570.8 MB
joewoo/serf-dnsmasq                     1.0                 f055d3ced087        2 hours ago         335.4 MB


  • if you use boot2docker install bash first.
> S
> Enter starting chars of desired extension, e.g. abi: bash
> tce - Tiny Core Extension browser
>	 1. bash-completion.tcz
>	 2. bash.tcz
> Enter selection ( 1 - 2 ) or (q)uit: 1
> A)bout I)nstall O)nDemand D)epends T)ree F)iles siZ)e L)ist S)earch P)rovides K)eywords or Q)uit: I
> exit
  • In China, to speeding up pull centos:6 docker by some docker hub mirrors like :daoCloud. Sometimes, yum install also not works well, just try one more time.

#####d. run container

 cd hadoop-spark-hive-cluster-docker


start master container...
start hive-mysql container...
start slave1 container...
start slave2 container...
  • start 4 containers,1 master, 2 slaves and 1 mysql
  • you will go to the /root directory of master container after start all containers

list the files inside /root directory of master container



hdfs  run-wordcount.sh    serf_log  start-hadoop.sh  start-spark.sh start-hive.sh  start-ssh-serf.sh

#####e. test serf and dnsmasq service

  • In fact, you can skip this step and just wait for about 1 minute. Serf and dnsmasq need some time to start service.

list all nodes of hadoop cluster

serf members


master.bdp.com  alive
slave1.bdp.com  alive
slave2.bdp.com  alive
mysql.bdp.com  alive
  • you can wait for a while if any nodes don't show up since serf agent need time to recognize all nodes

test ssh

ssh slave2.bdp.com

exit slave2 nodes



Connection to slave2.bdp.com closed.
  • Please wait for a whil if ssh fails, dnsmasq need time to configure domain name resolution service
  • You can start hadoop after these tests!

#####f. start hadoop


#####g. run wordcount



input file1.txt:
Hello Hadoop

input file2.txt:
Hello Docker

wordcount output:
Docker    1
Hadoop    1
Hello    2

####h. start spark


####i. run spark example

>val file=sc.textFile("hdfs://master.bdp.com:9000/user/root/input/file2.txt")  
>val count=file.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)  

####j. start hive


####k. run hive


##4. Steps to build arbitrary size Hadoop cluster

#####a. Preparation

  • check the steps a~b of section 3:pull images and clone source code

#####b. rebuild hadoop-master

./resize-cluster.sh 5
  • you can use any interger as the parameter for resize-cluster.sh: 1, 2, 3, 4, 5, 6...

#####c. start container

./start-container.sh 5
  • you'd better use the same parameter as the step b

#####d. run the cluster

  • check the steps d~k of section 3:test serf and dnsmasq, start Hadoop and run wordcount
  • please test serf and dnsmasq service before start hadoop