/hadoop-install-guide

πŸ“• Guide to installing a Hadoop and Spark on an Oracle virtual machine.

Primary LanguageShell

Static Badge Static Badge Static Badge Static Badge Static Badge

Hadoop install Guide

This repo is, describes how to build a Hadoop eco system using the Oracle virtual machine.

Contents

1. Setting VM

We will create 4 VMs.

VM Spec.

master-node
    - Processor : 2
    - System memory : 4096 MB
    - Vedio memory : 16 MB

worker-node1~3
    - Processor : 2
    - System memory : 4096 MB
    - Vedio memory : 16 MB

The reason for using two processors is so that we can use Kubernetes later. It doesn’t matter if you set it to default.

Now let's assign a static IP to each virtual machine.

I used a static IP like this:

master-node
    - IP : 192.168.1.10
    - Port : 10
worker-node1 
    - IP : 192.168.1.11
    - Port : 11
worker-node2 
    - IP : 192.168.1.12
    - Port : 12
worker-node3
    - IP : 192.168.1.13
    - Port : 13

This is the network config file path.

$ sudo vi /etc/netplan/00-installer-config.yaml

Open the configuration file and enter the IP values ​​for each node. The following is an example master node setup.

# This is the network config written by 'subiquity'
network:
    ethernets:
        enp0s3:
            addresses:
                - 192.168.1.10/24
            nameservers: 
                addresses: [8.8.8.8, 8.8.4.4]
            routes:
                - to: default
                  via: 192.168.1.1
    version: 2
$ sudo netplan apply

Complete the setup for the remaining virtual machines and apply your changes.

You can check the changed IP using the ifconfig command.

And Setting Hosts. (All node)

$ sudo vi /etc/hosts
127.0.0.1 localhost
192.168.0.10 master-node
192.168.0.11 worker-node1
192.168.0.12 worker-node2
192.168.0.13 worker-node3
...

Next, download Open JDK and Python3.

$ sudo apt-get update
$ sudo apt-get install openjdk-8-jdk
$ sudo apt install python3-pip

Change worker node name. ( worker-node1, 2, 3 )

$ sudo vi /etc/hostname
worker-node1
$ sudo hostname -F /etc/hostname # all-node
$ sudo hostname apply
$ sudo reboot

Remove comment.

$ sudo vi /etc/ssh/sshd_config
...
PubkeyAuthentication yes
...

The Hadoop master node enables ssh without a password for all worker nodes. ALL NODE!

$ chmod 700 ~/.ssh
$ ssh-keygen -t rsa -P ""
# Enter

Copy the master node's ssh key to the worker nodes.

# master node only
$ ssh-copy-id -i ~/.ssh/id_rsa.pub master-node
$ ssh-copy-id -i ~/.ssh/id_rsa.pub worker-node1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub worker-node2
$ ssh-copy-id -i ~/.ssh/id_rsa.pub worker-node3

# Check if ssh is connected properly
$ ssh worker-node1

2. Install Hadoop

Install hadoop ALL NODE!

$ wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
$ tar -xvf hadoop-3.3.6.tar.gz
$ mv hadoop-3.3.6 hadoop

Write down the worker and master.

$ vi hadoop/etc/hadoop/workers
# Workers File
worker-node1
worker-node2
worker-node3
# :wq!

$ vi hadoop/etc/hadoop/masters
# Masters File
master-node
# :wq!

Creates a basic frame.

Caution: You must create a namenode only on the master node.

# master node
$ mkdir data
$ cd data
$ mkdir namenode datanode tmp userlogs

# worker nodes
$ mkdir data
$ cd data
$ mkdir datanode tmp userlogs

Specify the path at the end of the ~/.bashrc file.

$ sudo vi ~/.bashrc

# .bashrc
...
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/master-node/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
# :wq!

$ source ~/.bashrc

Add hadoop-env.sh

$ vi hadoop/etc/hadoop/hadoop-env.sh

# hadoop-env.sh
...
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
# :wq!

Setting core-site.xml, yarn-site.xml, hdfs-site.xml

  • core-site.xml
<!-- all node are the same -->
...
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master-node:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/master-node/data/tmp</value>
    </property>
</configration>
  • yarn-site.xml
<!-- master node -->
...
<configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master-node</value>
    </property>
</configration>
<!-- worker nodes -->
...
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configration>
  • hdfs-site.xml
<!-- master node -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/home/master-node/data/namenode</value>
    </property>
</configration>
<!-- worker nodes -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/master-node/data/datanode</value>
    </property>
</configration>

3. Test HDFS

Finally the last step!!!

Initializes the namenode.

# master node
$ hdfs namenode -format
$ start-all.sh

# Check if this works!
$ jps

# If it works correctly, you will see the following command line:
1380 SecondaryNameNode
2391 Jps
1480 ResourceManager
1840 NameNode

# The same behavior can be seen on worker nodes.
4801 DataNode
4248 Jps
3059 NodeManager

Finally we can use it. Hadoop!! πŸ”₯πŸ”₯πŸ”₯πŸ”₯πŸ”₯

$ hdfs dfs -mkdir /test
$ hdfs dfs -ls /
# Check your COMMEND!!!

Next Step is install SPARK!

4. Install SPARK

Download spark.

This time is only master-node.

$ wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
$ tar -xvf spark-3.5.0-bin-hadoop3.tgz
$ mv spark-3.5.0-bin-hadoop3 spark

Edit ~/.bashrc file.

...
export SPARK_HOME=/home/master/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Apply

$ source ~/.bashrc

Configure Spark cluster.

$ cd $SPARK_HOME/conf
$ cp spark-env.sh.template spark-env.sh
$ vi spark-env.sh

Edit spark/conf/spark-env.shfile.

...
export SPARK_MASTER_HOST=master
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=4g
export SPARK_WORKER_INSTANCES=1

export JAVA_HOME=${JAVA_HOME}
export HADOOP_HOME=${HADOOP_HOME}
export YARN_CONF_DIR=${YARN_CONF_DIR}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}

Worker settings

$ vi ~/spark/conf/slaves

New file slaves

worker-node1
worker-node2
worker-node3

Now send it to each worker node.

$ scp -r spark master@worker-node1:/home/master/
$ scp -r spark master@worker-node2:/home/master/
$ scp -r spark master@worker-node3:/home/master/

Now let’s start the spark cluster.

$ ~/spark/sbin/start-all.sh
$ ~/spark/sbin/start-history-server.sh

How to stop spark cluster.

$ ~/spark/sbin/stop-all.sh
$ ~/spark/sbin/stop-history-server.sh

5. Error