/BigData

Step By Step guide for Hadoop installation on Ubuntu 16.04.3 with MapReduce example using Streaming

Primary LanguagePython

Step By Step guide for Hadoop installation on Ubuntu 16.04.3 with MapReduce example using Streaming

  1. Download Virtual Box from: https://www.virtualbox.org/wiki/Downloads

  2. Download Ubuntu 16.04.3 (desktop version amd64) from: https://www.ubuntu.com/download/desktop OR Direct Download from: http://mirror.pnl.gov/releases/xenial/ubuntu-16.04.3-desktop-amd64.iso

  3. create a VM with Ubuntu 16.04.3 image

  4. After installing Ubuntu login to th VM and follow instructions given in https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html . Here I am giving step by step details for the installation steps.

  5. First we will update the system's local repository and then install JAVA (default JDK). Run below commands on the terminal.

    sudo apt-get update

    sudo apt-get install default-jdk -y

  6. Now we will install ssh and rsync packages by running following commands.

    sudo apt-get install ssh -y

    sudo apt-get install rsync -y

  7. Now download Hadoop 2.7.4 from http://www.apache.org/dyn/closer.cgi/hadoop/common/

  8. Change directory to Downloads or where ever you have downloaded the hadoop tar file. In my case it is in Downloads and all further instruction are considering that hadoop tart file is in ~/Downloads.

  9. Change directory to extracted folder

  10. Update JAVA_HOME variable in etc/hadoop/hadoop-env.sh file using gedit command as shown below.

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

  1. Now you should be able to run hadoop; check it by running below command

bin/hadoop

  1. Now we will update some configuration files for pseudo-distributed operation. First we will edit etc/hadoop/core-site.xml file as below.

<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>

  1. Similarly, we will update etc/hadoop/hdfs-site.xml file as below.

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>

  1. Now we will setup passwordless ssh for Hadoop. First check if you already have passwordless ssh authentication setup; if it is new Ubuntu installation most likely it wouldn't set up. If passwordless ssh authentication is not setup, please follow next step othervise skip it.

ssh localhost

  1. run below commands:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

chmod 0600 ~/.ssh/authorized_keys

  1. Now we will start NameNode and DataNode but before that we will format the HDFS file system.

  2. Now we can access Web-interface for NameNode at http://localhost:50070/

  3. Now let's create some directories in HDFS filesystem.

  4. Let's download one html page http://hadoop.apache.org and upload into HDFS file system.

wget http://hadoop.apache.org -O hadoop_home_page.html

Please note that HDFS file system is not same as root file system.

Grep example:

  1. For this example we are using hadoop-mapreduce-examples-2.7.4.jar file which comes along with Hadoop. In this example we are trying to count the total number of 'https' word occurences in the given files. First we run the Hadoop job then copy the results from HDFS to the local file system. We can see that there are 2 occurences of https in the given file and same we can validate using wget command.

Wordcount example:

  1. For wordcount example also we are using hadoop-mapreduce-examples-2.7.4.jar file. The wordcount example returns the count of each word in the given documents.

Wordcount using Hadoop streaming (python)

  1. Here is mapper and reducer program for wordcount.

  2. We run the program as below and the copy the result to local file system.