bigdata-course-project

There are solutions for three tasks in my Big Data course porject.

Mention:

The punctuations in texts are not considered.
You could use .replace() or re.sub() in these mapper.py files to realize some transformations.

Task 01

There are three files in task01.

Please make sure your file path in HDFS, and put your file path of text corpus into the -input in run01.sh.

Use time command to record the running time:

time sh run01.sh

By my MapReduce System, the output showed:

real 8m50.352s
user 0m6.884s
sys 0m0.544s

Use hdfs dfs -cat OUTPUT_PATH/part-00000 command to check the outputs:

...
a	330283
aaron	113
...
abductions	2
abductor	3
abductors	2
...

Task 02

There are three files in task02.

Please make sure your file path in HDFS, and put your file path of text corpus into the -input in run02.sh.

What I have to mention is that:

Firstly, the function environ in package os.

import os

Besides, "mapreduce_map_input_file" and "map_input_file" in os.environ are similar. To prove this, I used these two in different files in task02 and task03 separately.

After setting your file path in run02.sh, you could command:

./run02.sh

Use hdfs dfs -cat OUTPUT_PATH/part-00000 command:

...
trouvent    {'Andrew-Lang___Old-Friends---Essays-in-Epistolary-Parody.txt': [2664, 2652]}
...
remarks?    {'Anthony-Trollope___He-Knew-He-Was-Right.txt': [34777]}
remarks;    {'Abraham-Lincoln___The-Writings-of-Abraham-Lincoln-Volume-5--1858-1862.txt': [6824], 'Beatrix-Potter___The-Tale-of-Johnny-Town-Mouse.txt': [61], 'Ambrose-Bierce___The-Devilss-Dictionary.txt': [571], 'Andrew-Lang___John-Knox-and-the-Reformation.txt': [7700], 'Anthony-Trollope___Miss-Mackenzie.txt': [8592]}
...

Task 03

See file task03. There are nine files in this three-step task. You could realize it by the file names.