/TF-IDF-Hadoop

Realization of TF-IDF algorithms based on MapReduce model

Primary LanguageJava

TF-IDF-Hadoop

Realization of TF-IDF algorithm based on MapReduce model https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Environment for launching HDFS and Hadoop code

Cloudera Virtual machine gives opportunity to lauch one cluster for hadoop experiments. It uses 4 GB RAM.
https://www.cloudera.com/downloads.html

TF-IDF.sh description

Bash script used for Job chaining. As long as chaining doesn't require complex actions, this is a good choice to chain this jobs. Bash script launches one-by-one jobs and copying files from Local FS(File System) to HDFS and vice versa. Every Java class with Job, Mapper, Reducer realization is packed into JAR and storaged in Local FS /home/cloudera/TFIDF/. Moreover, after every Job execution script deletes input for that Job.

Description of the process

  1. User launch TF-IDF.sh bash script with two arguments. First argument defines location of the input folder on local disk and the second argument defines local disk output folder, where final results will be copied after execution of the algorithm.

  2. Script counts number of files in input folder. It will be used in the third Job for calculating IDF.

  3. Input files is copying to Hadoop Distributed File System into /TF-IDF/input/

  4. Script launches Job_1
    Output of Job_1 stored in /TF-IDF/j1-output in HDFS
    Mapper:
    Input: <lineNumber, lineOfText>
    Output: <word#docname, 1>

    Reducer:
    Input: <word#docname, 1>
    Output: <word#docname, sum(1+1+..)>
    After the end of the Job_1 TF-IDF/input folder in HDFS is cleaned.

  5. Script launches Job_2
    Output of Job_2 stored in /TF-IDF/j2-output in HDFS
    Mapper:
    Input: <word#docname, n> <- Input from /TF-IDF/j1-output folder in HDFS
    Output: <docname, word=n>

    Reducer:
    Input: <docname, word=n>
    Output: <word#docname, n/N>
    After the end of the Job_2 /TF-IDF/j1-output folder in HDFS is cleaned.

  6. Script launches Job_3
    Output of Job_3 stored in /TF-IDF/output in HDFS. This is final output.
    Mapper:
    Input: <NumLine, word#docname n/N> <- Input from /TF-IDF/j2-output folder in HDFS
    Output: <word, docname=n/N>

    Reducer:
    Input: <word, docname=n/N>
    Output: <word#docname, TF-IDF = tf_idf TF = tf IDF = idf>

  7. Copying /TF-IDF/output folder to local disk in folder, specified by second argument in script launching