/r-hadoop

R-Hadoop example

Primary LanguageR

r-hadoop
(Jongwook Woo)
========

R-Hadoop example
Requirements: whirr-0.8.1
In order to start whirr at AWS, you may take a look at
a. http://dal-cloudcomputing.blogspot.com/2013/04/run-cloudera-hadoop-cluster-using-whirr.html
b. http://dal-cloudcomputing.blogspot.com/2011/06/how-to-set-up-hadoop-and-hbase-together.html

R-Hadoop example how to run:
1. download this r-hadoop and Jeffrey’s “R Hadoop Whirr”  tutorial using git; set the r-hadoop home with $R_HADOOP_HOME.
$ git clone git://github.com/hipic/r-hadoop.git
$ git clone git://github.com/jeffreybreen/tutorial-201209-TDWI-big-data.git

2. copy hadoop property files
$ cp $R_HADOOP_HOME/config/hadoop-ec2.properties $WHIRR_HOME/recipes/

3. run Cloudera Hadoop (CDH) using Whirr with the file "hadoop-ec2.properties" assuming key file is "id_rsa_whirr"
$ $WHIRR_HOME/bin/whirr launch-cluster --config $WHIRR_HOME/recipes/hadoop-ec2.properties  --private-key-file ~/.ssh/id_rsa_whirr

4. at $R-HADOOP_HOME/config, run the script file using whirr to install R related codes to all nodes at EC2:
$ $WHIRR_HOME/bin/whirr run-script --script install-r-CDH4.sh --config $WHIRR_HOME/recipes/hadoop-ec2.properties  --private-key-file ~/.ssh/id_rsa_whirr

If successful, you will see the following and rmr2 is installed at /usr/lib64/R/library/:
** building package indices
** testing if installed package can be loaded

* DONE (rmr2)
Making packages.html  ... done


5. At EC2 main node, test R with Hadoop:
$ mkdir rtest
$ cd rtest/
$ git clone git://github.com/hipic/r-hadoop.git

6. Download Airline schedule data:
~/rtest$ curl http://stat-computing.org/dataexpo/2009/2008.csv.bz2 | bzcat | head -1000 > air-2008.csv

~/rtest$ curl http://stat-computing.org/dataexpo/2009/2004.csv.bz2 | bzcat | head -1000 > 2004-1000.csv

7. Make HDFS directories:
~/rtest$ hadoop fs -mkdir airline
~/rtest$ hadoop fs -ls /user
Found 2 items
drwxr-xr-x   - hdfs     supergroup          0 2013-04-20 20:32 /user/hive
drwxr-xr-x   - jongwook supergroup          0 2013-04-20 21:30 /user/jongwook
~/rtest$ hadoop fs -ls /user/jongwook
Found 1 items
drwxr-xr-x   - jongwook supergroup          0 2013-04-20 21:30 /user/jongwook/airline
~/rtest$ hadoop fs -mkdir airline/data
~/rtest$ hadoop fs -ls airline
Found 1 items
drwxr-xr-x   - jongwook supergroup          0 2013-04-20 21:30 airline/data
~/rtest$ hadoop fs -mkdir airline/out
~/rtest$ hadoop fs -put  air-2008.csv airline/data/
~/rtest$ hadoop fs -mkdir airline/data04
~/rtest$ hadoop fs -put  2004-1000.csv airline/data04/

8. go to the example R code using Hadoop rmr
~/rtest$ cd r-hadoop/rmr

9. Run the R example code and see the result stored at "dept-delay-month-orig/" directory
$ ./dept-delay.R

Then you may see the following:

13/04/24 01:24:10 INFO streaming.StreamJob: Output: dept-delay-month-orig
$key
 [1]    1    1    1    1    2    2    2    2 2004 2004 2004 2004

$val
 [1]    1 2497    1   NA    2    3    2   90 2004  500 2004   NA


10. Check out hte content of HDFS file:
$ hadoop fs -cat dept-delay-month-orig/part-00001 | tail

Then, you may see the following:

SEQ/org.apache.hadoop.typedbytes.TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritable�'W5�l

11. You may also run the following right after then:
$ ./wordcount.R

Reference
1. http://jeffreybreen.wordpress.com/2012/03/10/big-data-step-by-step-slides/
2. http://www.rstudio.com/ide/download/server
3. https://github.com/jseidman
4. http://www.slideshare.net/jseidman/distributed-data-analysis-with-r-strangeloop-2011
5. http://dal-cloudcomputing.blogspot.com/2013/04/run-cloudera-hadoop-cluster-using-whirr.html
6. http://dal-cloudcomputing.blogspot.com/2011/06/how-to-set-up-hadoop-and-hbase-together.html