Realization of TF-IDF algorithm based on MapReduce model
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Cloudera Virtual machine gives opportunity to lauch one cluster for hadoop experiments. It uses 4 GB RAM.
https://www.cloudera.com/downloads.html
Bash script used for Job chaining. As long as chaining doesn't require complex actions, this is a good choice to chain this jobs. Bash script launches one-by-one jobs and copying files from Local FS(File System) to HDFS and vice versa. Every Java class with Job, Mapper, Reducer realization is packed into JAR and storaged in Local FS /home/cloudera/TFIDF/. Moreover, after every Job execution script deletes input for that Job.
-
User launch TF-IDF.sh bash script with two arguments. First argument defines location of the input folder on local disk and the second argument defines local disk output folder, where final results will be copied after execution of the algorithm.
-
Script counts number of files in input folder. It will be used in the third Job for calculating IDF.
-
Input files is copying to Hadoop Distributed File System into /TF-IDF/input/
-
Script launches Job_1
Output of Job_1 stored in /TF-IDF/j1-output in HDFS
Mapper:
Input: <lineNumber, lineOfText>
Output: <word#docname, 1>
Reducer:
Input: <word#docname, 1>
Output: <word#docname, sum(1+1+..)>
After the end of the Job_1 TF-IDF/input folder in HDFS is cleaned. -
Script launches Job_2
Output of Job_2 stored in /TF-IDF/j2-output in HDFS
Mapper:
Input: <word#docname, n> <- Input from /TF-IDF/j1-output folder in HDFS
Output: <docname, word=n>
Reducer:
Input: <docname, word=n>
Output: <word#docname, n/N>
After the end of the Job_2 /TF-IDF/j1-output folder in HDFS is cleaned. -
Script launches Job_3
Output of Job_3 stored in /TF-IDF/output in HDFS. This is final output.
Mapper:
Input: <NumLine, word#docname n/N> <- Input from /TF-IDF/j2-output folder in HDFS
Output: <word, docname=n/N>
Reducer:
Input: <word, docname=n/N>
Output: <word#docname, TF-IDF = tf_idf TF = tf IDF = idf> -
Copying /TF-IDF/output folder to local disk in folder, specified by second argument in script launching