Assignment A3 for CS6240

Fall 2017

Shreysa Sharma

###Directory Structure

src - Contains Kneighborhood.java associated with the implementation.

README.md - Description of the implementation and how to run

Makefile - Makefile to assist in building, running and managing the project.

report.Rmd - Report file in R-Markdown format.

report.html - HTML rendering of the above report.

input/books - Contains the input files that was provided for assignment A0 (taken from assignment task page).

input/big-corpus - contains an empty folder where the reviewer can place their big corpus files

observations - Contains csv files with the timings of Map reduce runs, log files of runs

###Instructions for building and running the program

  1. Clone the repository on your system
  2. run make gunzip
  3. if s3 bucket is not already present then issue: make make-bucket
  4. open the makefile and provide the below details: HADOOP_PATH=

MY_CLASSPATH=

jar.name=MainA2.jar

jar.path=target/$(jar.name)

local.input=input/books

local.logs=

local.output=

job.name=KNeighborhood

aws.emr.release=

aws.region=

aws.bucket.name=

aws.subnet.id=

aws.input= aws.output=

aws.log.dir=

aws.num.nodes=

aws.instance.type=

  1. to upload data to s3 input dir: make upload-input-aws
  2. on your local hadoop set up issue the command: make clean make build
  3. to upload application to s3: make upload-app-aws
  4. to create and launch EMR : make cloud
  5. to download data from s3: make download-output-aws
  6. to download logs: make download-logs
  7. If the emr launch doesnot happen then ssh to your EMR cluster, make clean, make build, upload data to cluster master using scp or manual upload
  8. if data is on cluster then copy to hadoop fs using : hadoop distcp s3://a3-emr/input_big/* input_big(here a3-emr is my bucket name and input_big is the folder where i have the big corpus files)
  9. if copied using scp then put data on Hadoop HDFS
  10. Then run the job by : make run INPUT=<input path> OUTPUT=<output path> NEIGHBORS=<KVALUE>
  11. Collect the results and analyze
  12. The analysis of time taken for run of the map reduce job that I executed on the big dataset provided for assignment A2 is placed in Observations/aws_emr_data.csv.
  13. make clean removes all the *.class files and the target folder, useful for a clean build.

System requirements

  1. make
  2. pandoc
  3. RMarkdown
  4. ggplot