Install Hadoop, set it up in Pseudo-Distributed mode, as described here
Some sample configuration files are included in the conf
folder.
Download the datasets for the IPL commenataries from the cricinfo website in the YAML format.
The data needs to be pre-processed and condensed before running Map-Reduce
In the App.java
file, uncomment the second main function, which is responsible for parsing the YAML file and converting it to a form suitable for Map-Reduce
Once uncommented the required main function (and comment out the Map-Reduce main function), then run mvn package
from the root folder of the project
This will generate a jar-with-dependencies.jar in the target folder. This is an executable jar, which will parse the YAML files and generate the CSV.
Run the program using java -jar target/average-1.0-SNAPSHOT-jar-with-dependencies.jar <path to ipl dataset folder> <output filename>
This generates the CSV file required by Map-Reduce
Start Hadoop by running the start-yarn.sh
and start-dfs.sh
from the proper locations.
Make a directory in Hadoop for input files through hdfs dfs -mkdir -p /user/input/ipl
Copy the CSV file into the input folder hdfs dfs -put <path to CSV file> /user/input/ipl
Create another folder to hold the output data: hdfs dfs -mkdir -p /user/ipl`
Uncomment the Map-Reduce code in the App.java class, and comment out the parser code. Once this is done, build the code using mvn install
.
Run the Map-Reduce program through: hadoop jar target/average-1.0-SNAPSHOT.jar com.pes.App /user/input/ipl /user/ipl/output 2016 "V Kohli"
Change the year and the batsman name appropriately if needed.
Once the Map-Reduce completes, retrieve the output from HDFS by copying the output folder to a local folder: hdfs dfs -get /user/ipl/output output
The output folder contains the results of the Map-Reduce.