We explore the social network aspect of the Enron Email dataset. The goal is to see how a network of people behaved in a company that was caught doing fraud and how the knowledge flowed through the network. Additionally, this project is a great way to learn about dealing with large unstructured data using the mapreduce concept and open source tools like hadoop virtual machine and R.
We use hadoop and shell as an alternative to transform the semi-unstructured email data into something we can work with. Using this data we will be able to visualize the social network grouped by the metrics such as edge betwenness, centrality index, etc.
The most challenging part of this project was coming up with rules to extract the metrics we wanted from the emails using mapreduce. We have done this with regular expressions, python and unix shell/hadoop. Additionally, it so happened that one worker node is much slower than shell when executing the mapreduce jobs. We provide examples in the code section for both shell and hadoop. We ran two mapreduce jobs in shell and they completed in around 25 minutes each.
Links to R analysis with code and without code.
Assuming the enron-emails
dataset folder is in the data
folder. You can download the Enron Email Dataset from
this link. Commands are executed from current (enron-network-using-mapreduce-and-R/
) directory.
For hadoop we are using CDH 5.5 virtual machine by Cloudera.
First, you will have to run emails-rename.sh
shell script to give the email files unique names instead of repeating numbers. Refer to this README.md in shell-scripts/
for more info.
mkdir data/enron-emails-sent
sh shell-scripts/emails-rename.sh data/enron-emails sent data/enron-emails-sent
sh shell-scripts/emails-rename.sh data/enron-emails inbox data/enron-emails-inbox
Now, we can upload uniquely named files to hdfs.
hadoop fs -mkdir enron-sent enron-inbox
hadoop fs -put data/enron-emails-sent/* enron-sent
hadoop fs -put data/enron-emails-inbox/* enron-inbox
Refer to README.md in mapreducers/
to find more info about the mappers and reducers. They can be executed in shell or using hadoop streaming API. Below are examples for the first mapreduce job. To execute the other one change the digit in the names of the mapper and reducer from 1 to 2.
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input enron-sent \
-output conns-sent -file mapper1.py -file reducer1.py \
-mapper mapper1.py -reducer reducer1.py
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input enron-inbox \
-output conns-inbox mapper1.py -file reducer1.py \
-mapper mapper1.py -reducer reducer1.py
# download results
hadoop fs -cat conns-sent > conns-sent.txt
hadoop fs -cat conns-inbox > conns-inbox.txt
Edit the reducer1.py
file. Comment out lines 37 and 43. Uncomment lines 38 and 44.
For below commands I just changed the output filename to n-conns-inbox.txt
.
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input enron-sent \
-output n-conns-sent -file mapper2.py -file reducer2.py \
-mapper mapper1.py -reducer reducer2.py
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input enron-inbox \
-output n-conns-inbox mapper2.py -file reducer2.py \
-mapper mapper2.py -reducer reducer2.py
# download results
hadoop fs -cat n-conns-sent > n-conns-sent.txt
hadoop fs -cat n-conns-inbox > n-conns-inbox.txt
find ../data/enron-emails-inbox/* -name '*.' -exec cat {} \; | python mapper1.py | sort | python reducer1.py > ../data/conns-inbox.txt
find ../data/enron-emails-sent/* -name '*.' -exec cat {} \; | python mapper1.py | sort | python reducer1.py > ../data/conns-inbox.txt
Same as for the hadoop usage.
find ../data/enron-emails-inbox/* -name '*.' -exec cat {} \; | python mapper1.py | sort | python reducer1.py > ../data/n-conns-inbox.txt
find ../data/enron-emails-sent/* -name '*.' -exec cat {} \; | python mapper1.py | sort | python reducer1.py > ../data/n-conns-inbox.txt