NeighborhoodScoreMedian_MapReduce: An HTML repository from shreysa

Assignment A3 for CS6240

###Directory Structure

src - Contains Kneighborhood.java associated with the implementation.

README.md - Description of the implementation and how to run

Makefile - Makefile to assist in building, running and managing the project.

report.Rmd - Report file in R-Markdown format.

report.html - HTML rendering of the above report.

input/books - Contains the input files that was provided for assignment A0 (taken from assignment task page).

input/big-corpus - contains an empty folder where the reviewer can place their big corpus files

observations - Contains csv files with the timings of Map reduce runs, log files of runs

###Instructions for building and running the program

MY_CLASSPATH=

jar.name=MainA2.jar

jar.path=target/$(jar.name)

local.input=input/books

local.logs=

local.output=

job.name=KNeighborhood

aws.emr.release=

aws.region=

aws.bucket.name=

aws.subnet.id=

aws.input= aws.output=

aws.log.dir=

aws.num.nodes=

aws.instance.type=

to upload data to s3 input dir: make upload-input-aws
on your local hadoop set up issue the command: make clean make build
to upload application to s3: make upload-app-aws
to create and launch EMR : make cloud
to download data from s3: make download-output-aws
to download logs: make download-logs
If the emr launch doesnot happen then ssh to your EMR cluster, make clean, make build, upload data to cluster master using scp or manual upload
if data is on cluster then copy to hadoop fs using : hadoop distcp s3://a3-emr/input_big/* input_big(here a3-emr is my bucket name and input_big is the folder where i have the big corpus files)
if copied using scp then put data on Hadoop HDFS
Then run the job by : make run INPUT=<input path> OUTPUT=<output path> NEIGHBORS=<KVALUE>
Collect the results and analyze
The analysis of time taken for run of the map reduce job that I executed on the big dataset provided for assignment A2 is placed in Observations/aws_emr_data.csv.
make clean removes all the *.class files and the target folder, useful for a clean build.