os-project-hadoop: A JavaScript repository from GGolfz

CSC217 Operating System Individual Project (Hadoop)

62130500212 Thanaphon Sombunkaeo

Virtual Machine

Create three Virtual Machines (Ubuntu18.0.4) on Google Cloud

hadoop-master: 137.116.150.214
hadoop-worker1: 104.215.184.216
hadoop-worker2: 13.76.26.55

Steps to set up hadoop

Create 3 virtual machine (1 master, 2 workers) and open every port for all virtual machines
SSH to each virtual machine.
Create non-root user and switch to that user. (This case I create user name hadoop, if you use another username, make sure that in other stpes you use your username, not hadoop).
```
sudo adduser hadoop
sudo adduser hadoop sudo
sudo su - hadoop
```

Update Ubuntu to the latest version. (All nodes)

sudo apt-get update && sudo apt-get -y dist-upgrade

Install Headless version of Java for Ubuntu. (All nodes)
```
sudo apt-get -y install openjdk-8-jdk-headless
```
Create a directory for hadoop to be installed. (All nodes)
```
mkdir hadoop && cd hadoop
```

Download and unzip Hadoop version 3.1.4 archive. (All nodes)

wget https://downloads.apache.org/hadoop/common/hadoop-3.1.4/hadoop-3.1.4.tar.gz

tar xvzf hadoop-3.1.4.tar.gz

Setup JAVA_HOME and other environments. (All nodes)

vi ~/hadoop/hadoop-3.1.4/etc/hadoop/hadoop-env.sh

At the top of file you should add environments like this.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HDFS_NAMENODE_USER="hadoop"
export HDFS_DATANODE_USER="hadoop"
export HDFS_SECONDARYNAMENODE_USER="hadoop"
export YARN_RESOURCEMANAGER_USER="hadoop"
export YARN_NODEMANAGER_USER="hadoop"

Exit and save file, then activate the environments

source ~/hadoop/hadoop-3.1.4/etc/hadoop/hadoop-env.sh

Create a directory for HDFS to store its important files. (All nodes)
```
sudo mkdir -p /usr/local/hadoop/hdfs/data
```

Set the permission to the file. (All nodes)

sudo chown -R hadoop:hadoop /usr/local/hadoop/hdfs/data

Update core_site.xml file. (All nodes)

vi ~/hadoop/hadoop-3.1.4/etc/hadoop/core-site.xml

In master node, change the content to be like this. (IP in value tag is 0.0.0.0 to accept external ip)

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://0.0.0.0:9000/</value>
  </property>
</configuration>

In slave nodes, change the content to be like this. (IP in value tag is IP of master node)

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://137.116.150.214:9000/</value>
  </property>
</configuration>

Create public-private key-pair on every nodes. (All nodes)
```
ssh-keygen
```
Copy public key from master node. (Master node)
```
cat ~/.ssh/id_rsa.pub
```
After you see the content of file, make sure that you copy the content of this file.
Paste master node’s public key into authorized_keys file of every node. (All nodes)
```
vi ~/.ssh/authorized_keys
```
Make sure that you paste public key of master node to authorized_keys file of every nodes.

On hadoop-master, open ssh configuration file. (Master node)

vi ~/.ssh/config

Add the content to be like this. (Host and host name is IP of your VM)

Host 137.116.150.214
    HostName 137.116.150.214
    User hadoop
    IdentityFile ~/.ssh/id_rsa

Host 104.215.184.216
    HostName 104.215.184.216
    User hadoop
    IdentityFile ~/.ssh/id_rsa

Host 13.76.26.55
    HostName 13.76.26.55
    User hadoop
    IdentityFile ~/.ssh/id_rsa

Config hdfs-site.xml on master node. (Master node)

vi ~/hadoop/hadoop-3.1.4/etc/hadoop/hdfs-site.xml

Change the content to be like this. (IP in value is IP of one of your slave node)

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///usr/local/hadoop/hdfs/data</value>
  </property>
  <property>
    <name>dfs.secondary.http.address</name>
    <value>104.215.184.216:50090</value>
  </property>
</configuration>

Config mapred-site.xml on master node. (Master node)

vi ~/hadoop/hadoop-3.1.4/etc/hadoop/mapred-site.xml

<configuration>
  <property>
    <name>mapreduce.jobtracker.address</name>
    <value>137.116.150.214:54311</value>
  </property>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
  </property>
  <property>
    <name>mapreduce.map.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
  </property>
  <property>
    <name>mapreduce.reduce.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
  </property>
</configuration>

Config yarn-site.xml on master node. (Master node)

vi ~/hadoop/hadoop-3.1.4/etc/hadoop/yarn-site.xml

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <!-- Site specific YARN configuration properties -->
</configuration>

Config masters file on master node. (Master node)

vi ~/hadoop/hadoop-3.1.4/etc/hadoop/masters

137.116.150.214

Config workers file on master node. (Master node)

vi ~/hadoop/hadoop-3.1.4/etc/hadoop/workers

localhost
104.215.184.216
13.76.26.55

Config hdfs-site.xml file on worker nodes. (All slave nodes) (IP in value is IP of master node)

vi ~/hadoop/hadoop-3.1.4/etc/hadoop/hdfs-site.xml

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>137.116.150.214</value>
  </property>
</configuration>

Config hots (All nodes)

sudo vi /etc/hosts

127.0.0.1 localhost
137.116.150.214 hadoop-master
104.215.184.216 hadoop-worker-01
13.76.26.55 hadoop-worker-02

Setup alias and environment variable (All nodes)

vi ~/.bashrc

add this content on top of the file

alias hadoop="~/hadoop/hadoop-3.1.4/bin/hadoop"
alias hadoop_start="~/hadoop/hadoop-3.1.4/sbin/start-all.sh"
alias hadoop_stop="~/hadoop/hadoop-3.1.4/sbin/stop-all.sh"
alias hadoop_clear="sudo ~/hadoop/hadoop-3.1.4/bin/hdfs namenode -format"
alias hadoop_logs="cd ~/hadoop/hadoop-3.1.4/logs"
alias hadoop_setting="cd ~/hadoop/hadoop-3.1.4/etc/hadoop"

activate the .bashrc file

source ~/.bashrc

Start Hadoop (on master node)

You can go to hadoop installation directory to start hadoop or use alias to start
```
cd ~/hadoop/hadoop-3.1.4/
```
```
sudo ./bin/hdfs namenode -format
```
```
sudo ./sbin/start-all.sh
```
(Optional -> start hadoop by alias command)

(This command use to clear namnode, you only run this command only once or when you want to clear name node)
```
hadoop_clear
```
```
hadoop_start
```
Test services of Hadoop

After you run
```
jps
```
On master node you should see like this
```
10867 NodeManager
11411 Jps
10149 NameNode
10329 DataNode
10681 ResourceManager
```
On slave node, you should see like this (SecondayNameNode only run in node that you set it as a secondary name node)
```
24086 NodeManager
23928 SecondaryNameNode
23741 DataNode
24461 Jps
```
You can access to your master node ip and port 9870 to see the information of cluster (in DataNodes page, you should see 3 data nodes running)
```
http://137.116.150.214:9870/
```

Example of hdfs (Hadoop File System) command

All of these command execute in the mastre node (All of command similar to linux command)

Create directory (mkdir)

hadoop fs -mkdir input
hadoop fs -mkdir output

list all directory and files in hdfs. (You should see /input) (ls)
```
hadoop fs -ls /
```
Put file from local to hdfs (You need to create file inside your vm first)
```
cd ~
vi input_1
```
You can put any text inside the file
```
hello world boy hello boy test
```
```
hadoop fs -put input_1 /input
```
See content of file (cat)
```
hadoop cat /input/input_1
```
Remove directory (rm)
```
hadoop fs -rm /output
```
Remove directory and all files inside directory (rm -R)
```
hadoop fs -rm -R /input
```

Run simple mapreduce (Word Count Program)

All of these command run in user directory (cd ~)

Install hadoop core

wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-core/1.2.1/hadoop-core-1.2.1.jar

Create WordCount.java
```
vi WordCount.java
```
copy content from this link into file

https://www.dropbox.com/s/yp9i7nwmgzr3nkx/WordCount.java?dl=0

Create input file and put it into hdfs

vi input_1

hello world boy hello boy test

hadoop fs -mkdir /input
hadoop fs -put input_1 /input

Create jar file from WordCount.java

mkdir mapR
javac -classpath hadoop-core-1.2.1.jar -d mapR WordCount.java
jar -cvf WordCount.jar -C mapR/ .

Run the jar file (all files in /input in hdfs will be the input of program)
```
hadoop jar WordCount.jar WordCount /input /output
```
After the mapreduce job is finished, you can see the output in /output directory
```
hadoop fs -ls /output
hadoop fs -cat /output/part-r-00000
```
The output is a mapping between word and count of that word

TroubleShooting

If you start the hadoop but some service is missing, you can see the log of error

This command will change directory to log directory (This directory contains log of all services)

hadoop_logs

Data Node is not starting

Stop hadoop and format name node on master node
```
hadoop_stop
hadoop_clear
```
Delete dfs folder on all nodes
```
cd /tmp/hadoop-hadoop
rm -rf dfs
```
Start hadoop on master node
```
hadoop_start
```

Now when you run jps command, you will see data node is running

GGolfz/os-project-hadoop