This project consists of a set of scripts and Java applications for running Hadoop MapReduce jobs to perform word count operations on large text datasets. The project includes two main Java applications (KC.java and RC.java) and a series of Bash scripts for compiling, running the jobs, and processing their outputs.
.
├── KC.java # Java source file for KC application
├── RC.java # Java source file for RC application
├── nostop-output-terminal.txt # ouput of the Java applicaiton KC
├── stop-output-terminal.txt # ouput of the Java applicaiton RC
├── kc.jar # Compiled JAR for the KC application
├──README.md
├── nostop.sh # Script for running the KC application
├── input # Input file for the Applicaions
│ ├── b1.txt
│ ├── b2.txt
│ ├── b3.txt
│ ├── b4.txt
│ └── b5.txt
├── rc.jar # Compiled JAR for the RC application
├── stop.sh # Script for running the RC application
├── top_25_KC.sh # Script to display top 25 words from KC output
└── top_25_RC.sh # Script to display top 25 words from RC output
Compile and Package the Java Application:
$ hadoop com.sun.tools.javac.Main RC.java
$ jar cf rc.jar RC*.class
$ hdfs dfs -copyToLocal /user/parallels/Wordcount/2finalstop /media/psf/SSD/dic-fi/
##Or
./stop.sh
The process is similar to RC.java, using KC.java, kc.jar, and nostop.sh respectively.
$ ./nostops.sh
To analyze the output of the MapReduce jobs, use the following scripts to display the top 25 most frequent words:
./top_25_RC.sh
For RC.java Output:
elizabeth 691
man 527
time 485
project 442
gutenberg 435
good 431
alice 385
darcy 380
hester 361
life 359
day 347
love 338
lady 319
long 317
thought 315
eyes 310
bennet 307
letter 305
great 303
romeo 303
work 303
dear 300
mother 281
father 280
heart 278
./top_25_KC.sh
For KC.java Output:
the 17376
and 11352
of 11234
to 10022
a 6880
i 6396
in 5524
that 4377
was 4231
it 4030
her 4002
with 3428
my 3211
she 3143
you 3097
as 3079
not 3032
his 2983
he 2833
had 2830
be 2641
but 2545
for 2517
is 2130
on 1965