There are solutions for three tasks in my Big Data course porject.
Mention:
- The punctuations in texts are not considered.
- You could use
.replace()
orre.sub()
in thesemapper.py
files to realize some transformations.
There are three files in task01
.
Please make sure your file path in HDFS, and put your file path of text corpus into the -input
in run01.sh
.
Use time
command to record the running time:
time sh run01.sh
By my MapReduce System, the output showed:
real 8m50.352s
user 0m6.884s
sys 0m0.544s
Use hdfs dfs -cat OUTPUT_PATH/part-00000
command to check the outputs:
...
a 330283
aaron 113
...
abductions 2
abductor 3
abductors 2
...
There are three files in task02
.
Please make sure your file path in HDFS, and put your file path of text corpus into the -input
in run02.sh
.
What I have to mention is that:
- Firstly, the function
environ
in packageos
.
import os
- Besides,
"mapreduce_map_input_file"
and"map_input_file"
inos.environ
are similar. To prove this, I used these two in different files intask02
andtask03
separately.
After setting your file path in run02.sh
, you could command:
./run02.sh
Use hdfs dfs -cat OUTPUT_PATH/part-00000
command:
...
trouvent {'Andrew-Lang___Old-Friends---Essays-in-Epistolary-Parody.txt': [2664, 2652]}
...
remarks? {'Anthony-Trollope___He-Knew-He-Was-Right.txt': [34777]}
remarks; {'Abraham-Lincoln___The-Writings-of-Abraham-Lincoln-Volume-5--1858-1862.txt': [6824], 'Beatrix-Potter___The-Tale-of-Johnny-Town-Mouse.txt': [61], 'Ambrose-Bierce___The-Devilss-Dictionary.txt': [571], 'Andrew-Lang___John-Knox-and-the-Reformation.txt': [7700], 'Anthony-Trollope___Miss-Mackenzie.txt': [8592]}
...
See file task03
. There are nine files in this three-step task. You could realize it by the file names.
What I have to mention is that:
- The original text corpus cannot be used if you still use commas to split variables. Because of the commas in these filenames, the
.split(',', 1)
function will be confused. - So, there are two methods:
- Remove these commas in filenames firstly.
- Use another separator which is not in filenames, like
/
. (Better Choice)
- Besides, the output from
run0301.sh
is the input ofrun0302.sh
, and output ofrun0302.sh
is the input ofrun0303.sh
.
Then after setting your file path in these .sh
, you could separately command:
./run0301.sh
./run0302.sh
./run0303.sh
Use hdfs dfs -cat OUTPUT_PATH/part-00000
command:
...
lowering,Anthony-Trollope___The-Prime-Minister.txt 7.28318916935e-07
lowering,Andrew-Lang___The-Grey-Fairy-Book.txt 2.08061697229e-06
lowering,Anthony-Trollope___Life-of-Cicero-Volume-One.txt 1.68287839521e-06
unlearned,Andrew-Lang___Essays-in-Little.txt 8.18022675589e-06
unlearned,Andrew-Lang___Letters-to-Dead-Authors.txt 1.36579266053e-05
unlearned,Alexander-Pope___The-Works-of-Alexander-Pope,-Volume-1.txt 2.48569619932e-06
unlearned,Anthony-Trollope___Lady-Anna.txt 3.42462756224e-06
...