###Directory Structure
src
- Contains Kneighborhood.java associated with the implementation.
README.md
- Description of the implementation and how to run
Makefile
- Makefile to assist in building, running and managing the project.
report.Rmd
- Report file in R-Markdown format.
report.html
- HTML rendering of the above report.
input/books
- Contains the input files that was provided for assignment A0 (taken from assignment task page).
input/big-corpus
- contains an empty folder where the reviewer can place their big corpus files
observations
- Contains csv files with the timings of Map reduce runs, log files of runs
###Instructions for building and running the program
- Clone the repository on your system
- run make gunzip
- if s3 bucket is not already present then issue:
make make-bucket
- open the makefile and provide the below details: HADOOP_PATH=
MY_CLASSPATH=
jar.name=MainA2.jar
jar.path=target/$(jar.name)
local.input=input/books
local.logs=
local.output=
job.name=KNeighborhood
aws.emr.release=
aws.region=
aws.bucket.name=
aws.subnet.id=
aws.input= aws.output=
aws.log.dir=
aws.num.nodes=
aws.instance.type=
- to upload data to s3 input dir:
make upload-input-aws
- on your local hadoop set up issue the command:
make clean
make build
- to upload application to s3:
make upload-app-aws
- to create and launch EMR :
make cloud
- to download data from s3:
make download-output-aws
- to download logs:
make download-logs
- If the emr launch doesnot happen then ssh to your EMR cluster,
make clean
,make build
, upload data to cluster master using scp or manual upload - if data is on cluster then copy to hadoop fs using :
hadoop distcp s3://a3-emr/input_big/* input_big
(here a3-emr is my bucket name and input_big is the folder where i have the big corpus files) - if copied using scp then put data on Hadoop HDFS
- Then run the job by :
make run INPUT=<input path> OUTPUT=<output path> NEIGHBORS=<KVALUE>
- Collect the results and analyze
- The analysis of time taken for run of the map reduce job that I executed on the big dataset provided for assignment A2 is placed in Observations/aws_emr_data.csv.
- make clean removes all the *.class files and the target folder, useful for a clean build.
- make
- pandoc
- RMarkdown
- ggplot