We first create a folder named scripts in the hdfs
hdfs dfs -mkdir /scripts
We use this command to run the inbuilt wordcount example with the 500 random words generated using the random word generator,
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount /scripts/wordcountinput.txt /scripts/wordcountop
We get the output using,
hdfs dfs -get /scripts/* /scripts
To view the output we use,
hdfs dfs -cat /wordcountop/part-00000
To execute the program, we first put the input file, mapper and reducer to dfs by,
hdfs dfs -put /scripts/ngraminput.txt /scripts
hdfs dfs -put /scripts/ngrammapper.py /scripts
hdfs dfs -put /scripts/ngramreducer.py /scripts
This will put the files into the hdfs
Now we run using the following command,
hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar -mapper "python3 /scripts/ngrammapper.py 2" -reducer "python3 /scripts/ngramreducer.py" -input /ngraminput.txt -output /outputngram
To view the output we use the following command,
hdfs dfs -cat /outputngram/part-00000
We get outputs from hdfs using the following command
hdfs dfs -get /scripts/* /scripts
We use the folder that we have previously created in Part 2 and use it to put the mapper and reducer files into hdfs using the following command,
hdfs dfs -put /scripts/* /scripts
We should also put the access_log,
hdfs dfs -put /access_log /scripts
To run the code we use the following command,
hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar -mapper "python3 /scripts/mapper1.py" -reducer "python3 /scripts/reducer1.py" -input /access_log -output /outputone
We use the above command to run the code for the first set of mapper and reducer, similarly we do it for all the 10 mappers and reducers. To view the output we use,
hdfs dfs -cat /outputone/part-00000
Then we get the output files with the following command,
hdfs dfs -get /scripts/* /scripts