/oscon-bigtop

Presentation on Apache Bigtop at OSCON 2013

Primary LanguageShell

oscon-bigtop

Resources for the demo presented at Bigtop presentation at OSCON 2013

Introduction

This repository contains data and code for use by a beginner's introduction to Apache Bigtop. All instructions have been inspired from the instructions on the Bigtop wiki page for 0.6.0

File list and description

  • README.md: This file, with instructions for your use
  • demo-setup.sh: A script I used to provision a single node for using in my demo. This script installs some essential tools and software, grabs a well known publicly available dataset and inserts it into a relational DB. Note that this script was run on Ubuntu Lucid 64 bit machine. It should work on other Ubuntu/Debian variants as well. It can be easily ported for use on RPM based systems. I haven't gotten a chance to do so but if you decide to do so, please send me a pull request, thanks! Note that Ubuntu doesn't come with JDK6 by default so this script also installs Oracle JDK6. By the time this script is done, your machine is ready for installing Bigtop as per the instructions below.
  • median_income_by_zipcode_census_2000.zip: A household income dataset from 2000 United States Census for use in the demo.

Demo VM setup

Let's prepare a VM for your setup. You can use any of OS for your VM. The setup script dmeo-setup.sh was run and has only been tested on Lucid. Also, I use vagrant for all my VM needs. Vagrant builds on top of Virtualbox but you are welcome to use any VM Hypervisor software of your choice. All you need is a vanilla Linux install at the end of the day:-) Log in to the VM and run the following commands:


wget https://raw.github.com/markgrover/oscon-bigtop/master/demo-setup.sh
chmod 755 ./demo-setup.sh
./demo-setup.sh

This setup may take a while, please be patient!

Instructions

Inspired from the Bigtop wiki page

  • Add Bigtop key so you can use the Bigtop artifacts with apt-get

wget -O- http://archive.apache.org/dist/bigtop/bigtop-0.6.0/repos/GPG-KEY-bigtop | sudo apt-key add -

  • Add Bigtop list, so apt-get knows where to find the Bigtop artifacts

sudo wget -O /etc/apt/sources.list.d/bigtop-0.6.0.list http://archive.apache.org/dist/bigtop/bigtop-0.6.0/repos/`lsb_release --codename --short`/bigtop.list

  • Update apt-get so it sees our newly added Bigtop repository

sudo apt-get update

  • Install our pseduo-distributed hadoop package from Apache Bigtop

sudo apt-get install hadoop-conf-pseudo

  • Initialize the name (needs to be done only once). Don't do it again, it will format (i.e wipe off) the data in your cluster (i.e. data on HDFS).

sudo service hadoop-hdfs-namenode init

  • Start the namenode and datanode

sudo service hadoop-hdfs-namenode start
sudo service hadoop-hdfs-datanode start

  • Initialize HDFS. This creates a bunch of directories on HDFS that are required for YARN to run

sudo /usr/lib/hadoop/libexec/init-hdfs.sh
  • Restart YARN daemons. Yarn needs the directories created by the previous step to work properly. Since we just created those directories, let's restart YARN daemons.

sudo service hadoop-yarn-resourcemanager restart
sudo service hadoop-yarn-nodemanager restart
  • Time to run our first MapReduce Job. This is run using MapReduce v2, running on top of YARN.

sudo -u root hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples*.jar pi 10 1000

  • If you want to install more artifacts, all you do is run a simple apt-get command

sudo apt-get install hive

  • There is a bug in Bigtop because of which we have to correct the permission of one of the HDFS directories before we run hive. Let's do that first

sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

  • Let's create a table in Hive. On bash, type:

hive -e "CREATE TABLE zipcode_incomes(id STRING, zip STRING, description1 STRING, description2 STRING, income INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','"

  • Load the data in Hive. On bash, type:

cd ~
hive -e "LOAD DATA LOCAL INPATH 'workspace/dataset/DEC_00_SF3_P077_with_ann.csv' OVERWRITE INTO TABLE zipcode_incomes"

  • Run Hive queries!

To clean up (only if necessary)


sudo rm -rf /var/lib/hadoop-hdfs/cache/*
sudo service hadoop-hdfs-datanode stop
sudo service hadoop-hdfs-namenode stop
sudo service hadoop-hdfs-namenode init