Create three Virtual Machines (Ubuntu18.0.4) on Google Cloud
- hadoop-master: 137.116.150.214
- hadoop-worker1: 104.215.184.216
- hadoop-worker2: 13.76.26.55
-
Create 3 virtual machine (1 master, 2 workers) and open every port for all virtual machines
-
SSH to each virtual machine.
-
Create non-root user and switch to that user. (This case I create user name hadoop, if you use another username, make sure that in other stpes you use your username, not hadoop).
sudo adduser hadoop sudo adduser hadoop sudo sudo su - hadoop
-
Update Ubuntu to the latest version. (All nodes)
sudo apt-get update && sudo apt-get -y dist-upgrade
-
Install Headless version of Java for Ubuntu. (All nodes)
sudo apt-get -y install openjdk-8-jdk-headless
-
Create a directory for hadoop to be installed. (All nodes)
mkdir hadoop && cd hadoop
-
Download and unzip Hadoop version 3.1.4 archive. (All nodes)
wget https://downloads.apache.org/hadoop/common/hadoop-3.1.4/hadoop-3.1.4.tar.gz tar xvzf hadoop-3.1.4.tar.gz
-
Setup JAVA_HOME and other environments. (All nodes)
vi ~/hadoop/hadoop-3.1.4/etc/hadoop/hadoop-env.sh
At the top of file you should add environments like this.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HDFS_NAMENODE_USER="hadoop" export HDFS_DATANODE_USER="hadoop" export HDFS_SECONDARYNAMENODE_USER="hadoop" export YARN_RESOURCEMANAGER_USER="hadoop" export YARN_NODEMANAGER_USER="hadoop"
Exit and save file, then activate the environments
source ~/hadoop/hadoop-3.1.4/etc/hadoop/hadoop-env.sh
-
Create a directory for HDFS to store its important files. (All nodes)
sudo mkdir -p /usr/local/hadoop/hdfs/data
-
Set the permission to the file. (All nodes)
sudo chown -R hadoop:hadoop /usr/local/hadoop/hdfs/data
-
Update core_site.xml file. (All nodes)
vi ~/hadoop/hadoop-3.1.4/etc/hadoop/core-site.xml
In master node, change the content to be like this. (IP in value tag is 0.0.0.0 to accept external ip)
<configuration> <property> <name>fs.default.name</name> <value>hdfs://0.0.0.0:9000/</value> </property> </configuration>
In slave nodes, change the content to be like this. (IP in value tag is IP of master node)
<configuration> <property> <name>fs.default.name</name> <value>hdfs://137.116.150.214:9000/</value> </property> </configuration>
-
Create public-private key-pair on every nodes. (All nodes)
ssh-keygen
-
Copy public key from master node. (Master node)
cat ~/.ssh/id_rsa.pub
After you see the content of file, make sure that you copy the content of this file.
-
Paste master node’s public key into authorized_keys file of every node. (All nodes)
vi ~/.ssh/authorized_keys
Make sure that you paste public key of master node to authorized_keys file of every nodes.
-
On hadoop-master, open ssh configuration file. (Master node)
vi ~/.ssh/config
Add the content to be like this. (Host and host name is IP of your VM)
Host 137.116.150.214 HostName 137.116.150.214 User hadoop IdentityFile ~/.ssh/id_rsa Host 104.215.184.216 HostName 104.215.184.216 User hadoop IdentityFile ~/.ssh/id_rsa Host 13.76.26.55 HostName 13.76.26.55 User hadoop IdentityFile ~/.ssh/id_rsa
-
Config hdfs-site.xml on master node. (Master node)
vi ~/hadoop/hadoop-3.1.4/etc/hadoop/hdfs-site.xml
Change the content to be like this. (IP in value is IP of one of your slave node)
<configuration> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///usr/local/hadoop/hdfs/data</value> </property> <property> <name>dfs.secondary.http.address</name> <value>104.215.184.216:50090</value> </property> </configuration>
-
Config mapred-site.xml on master node. (Master node)
vi ~/hadoop/hadoop-3.1.4/etc/hadoop/mapred-site.xml
<configuration> <property> <name>mapreduce.jobtracker.address</name> <value>137.116.150.214:54311</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> </property> </configuration>
-
Config yarn-site.xml on master node. (Master node)
vi ~/hadoop/hadoop-3.1.4/etc/hadoop/yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <!-- Site specific YARN configuration properties --> </configuration>
-
Config masters file on master node. (Master node)
vi ~/hadoop/hadoop-3.1.4/etc/hadoop/masters
137.116.150.214
-
Config workers file on master node. (Master node)
vi ~/hadoop/hadoop-3.1.4/etc/hadoop/workers
localhost 104.215.184.216 13.76.26.55
-
Config hdfs-site.xml file on worker nodes. (All slave nodes) (IP in value is IP of master node)
vi ~/hadoop/hadoop-3.1.4/etc/hadoop/hdfs-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>137.116.150.214</value> </property> </configuration>
-
Config hots (All nodes)
sudo vi /etc/hosts
127.0.0.1 localhost 137.116.150.214 hadoop-master 104.215.184.216 hadoop-worker-01 13.76.26.55 hadoop-worker-02
-
Setup alias and environment variable (All nodes)
vi ~/.bashrc
add this content on top of the file
alias hadoop="~/hadoop/hadoop-3.1.4/bin/hadoop" alias hadoop_start="~/hadoop/hadoop-3.1.4/sbin/start-all.sh" alias hadoop_stop="~/hadoop/hadoop-3.1.4/sbin/stop-all.sh" alias hadoop_clear="sudo ~/hadoop/hadoop-3.1.4/bin/hdfs namenode -format" alias hadoop_logs="cd ~/hadoop/hadoop-3.1.4/logs" alias hadoop_setting="cd ~/hadoop/hadoop-3.1.4/etc/hadoop"
activate the .bashrc file
source ~/.bashrc
-
Start Hadoop (on master node)
You can go to hadoop installation directory to start hadoop or use alias to start
cd ~/hadoop/hadoop-3.1.4/
sudo ./bin/hdfs namenode -format
sudo ./sbin/start-all.sh
(Optional -> start hadoop by alias command)
(This command use to clear namnode, you only run this command only once or when you want to clear name node)
hadoop_clear
hadoop_start
-
Test services of Hadoop
After you run
jps
On master node you should see like this
10867 NodeManager 11411 Jps 10149 NameNode 10329 DataNode 10681 ResourceManager
On slave node, you should see like this (SecondayNameNode only run in node that you set it as a secondary name node)
24086 NodeManager 23928 SecondaryNameNode 23741 DataNode 24461 Jps
You can access to your master node ip and port 9870 to see the information of cluster (in DataNodes page, you should see 3 data nodes running)
http://137.116.150.214:9870/
All of these command execute in the mastre node (All of command similar to linux command)
-
Create directory (mkdir)
hadoop fs -mkdir input hadoop fs -mkdir output
-
list all directory and files in hdfs. (You should see /input) (ls)
hadoop fs -ls /
-
Put file from local to hdfs (You need to create file inside your vm first)
cd ~ vi input_1
You can put any text inside the file
hello world boy hello boy test
hadoop fs -put input_1 /input
-
See content of file (cat)
hadoop cat /input/input_1
-
Remove directory (rm)
hadoop fs -rm /output
-
Remove directory and all files inside directory (rm -R)
hadoop fs -rm -R /input
All of these command run in user directory (cd ~)
-
Install hadoop core
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-core/1.2.1/hadoop-core-1.2.1.jar
-
Create WordCount.java
vi WordCount.java
copy content from this link into file
https://www.dropbox.com/s/yp9i7nwmgzr3nkx/WordCount.java?dl=0
-
Create input file and put it into hdfs
vi input_1
hello world boy hello boy test
hadoop fs -mkdir /input hadoop fs -put input_1 /input
-
Create jar file from WordCount.java
mkdir mapR javac -classpath hadoop-core-1.2.1.jar -d mapR WordCount.java jar -cvf WordCount.jar -C mapR/ .
-
Run the jar file (all files in /input in hdfs will be the input of program)
hadoop jar WordCount.jar WordCount /input /output
-
After the mapreduce job is finished, you can see the output in /output directory
hadoop fs -ls /output hadoop fs -cat /output/part-r-00000
The output is a mapping between word and count of that word
If you start the hadoop but some service is missing, you can see the log of error
This command will change directory to log directory (This directory contains log of all services)
hadoop_logs
-
Stop hadoop and format name node on master node
hadoop_stop hadoop_clear
-
Delete dfs folder on all nodes
cd /tmp/hadoop-hadoop rm -rf dfs
-
Start hadoop on master node
hadoop_start
Now when you run jps command, you will see data node is running