/Average-Length-of-Comments-Using-Hadoop-MapReduce

Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Primary LanguageJava

Finding average number of words in all the comments in a data set

πŸ“ Mapper Function

In the mapper function we first tokenize entire data and then find first occurrence of β€˜Text=”’ which signifies the beginning of the comment and then count number of words in the comment until β€˜β€β€™ is found which signifies end of comment.

πŸ“Š Reducer function

Length of each comment is sent to reducer with one single standard key – β€˜key’. Reducer sums each value and counts number of values which depicts total number of comments. The sum is divided by number of comments which gives us the average which is sent back to main main and displayed.

Files included:

Code can be found in the .java files, while complete .jar file is also available

Screenshots of output

βœ… Find below screenshot of testrun:

  • INPUT - In the picture shown below, 11 rows were given as input so the average length given by Hadoop MapReduce could be manually checked

task2-testoutput

  • OUTPUT - As we can see total number of words in each comment is divided by total number of comments, giving us the answer 33.

task2-testinput