/enron-network-using-mapreduce-and-R

Network analysis of Enron Email Dataset using mapreduce and R

Primary LanguageHTML

Network analysis of Enron emails

Enron Network over Time Based on Sent Emails | Date Range: 1998-11->2001-04

What it is about?

We explore the social network aspect of the Enron Email dataset. The goal is to see how a network of people behaved in a company that was caught doing fraud and how the knowledge flowed through the network. Additionally, this project is a great way to learn about dealing with large unstructured data using the mapreduce concept and open source tools like hadoop virtual machine and R.

We use hadoop and shell as an alternative to transform the semi-unstructured email data into something we can work with. Using this data we will be able to visualize the social network grouped by the metrics such as edge betwenness, centrality index, etc.

The most challenging part of this project was coming up with rules to extract the metrics we wanted from the emails using mapreduce. We have done this with regular expressions, python and unix shell/hadoop. Additionally, it so happened that one worker node is much slower than shell when executing the mapreduce jobs. We provide examples in the code section for both shell and hadoop. We ran two mapreduce jobs in shell and they completed in around 25 minutes each.

R Network Analysis

Links to R analysis with code and without code.

Code

Upload Enron data to hadoop

Assuming the enron-emails dataset folder is in the data folder. You can download the Enron Email Dataset from this link. Commands are executed from current (enron-network-using-mapreduce-and-R/) directory.

For hadoop we are using CDH 5.5 virtual machine by Cloudera.

First, you will have to run emails-rename.sh shell script to give the email files unique names instead of repeating numbers. Refer to this README.md in shell-scripts/ for more info.

mkdir data/enron-emails-sent
sh shell-scripts/emails-rename.sh data/enron-emails sent data/enron-emails-sent
sh shell-scripts/emails-rename.sh data/enron-emails inbox data/enron-emails-inbox

Now, we can upload uniquely named files to hdfs.

hadoop fs -mkdir enron-sent enron-inbox
hadoop fs -put data/enron-emails-sent/* enron-sent
hadoop fs -put data/enron-emails-inbox/* enron-inbox

Mapreduce

Refer to README.md in mapreducers/ to find more info about the mappers and reducers. They can be executed in shell or using hadoop streaming API. Below are examples for the first mapreduce job. To execute the other one change the digit in the names of the mapper and reducer from 1 to 2.

Hadoop

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input enron-sent \
-output conns-sent -file mapper1.py -file reducer1.py \
-mapper mapper1.py -reducer reducer1.py

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input enron-inbox \
-output conns-inbox mapper1.py -file reducer1.py \
-mapper mapper1.py -reducer reducer1.py

# download results
hadoop fs -cat conns-sent > conns-sent.txt
hadoop fs -cat conns-inbox > conns-inbox.txt

Edit the reducer1.py file. Comment out lines 37 and 43. Uncomment lines 38 and 44.

For below commands I just changed the output filename to n-conns-inbox.txt.

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input enron-sent \
-output n-conns-sent -file mapper2.py -file reducer2.py \
-mapper mapper1.py -reducer reducer2.py

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input enron-inbox \
-output n-conns-inbox mapper2.py -file reducer2.py \
-mapper mapper2.py -reducer reducer2.py

# download results
hadoop fs -cat n-conns-sent > n-conns-sent.txt
hadoop fs -cat n-conns-inbox > n-conns-inbox.txt

Shell

find ../data/enron-emails-inbox/* -name '*.' -exec cat {} \; | python mapper1.py | sort | python reducer1.py > ../data/conns-inbox.txt

find ../data/enron-emails-sent/* -name '*.' -exec cat {} \; | python mapper1.py | sort | python reducer1.py > ../data/conns-inbox.txt

Same as for the hadoop usage.

find ../data/enron-emails-inbox/* -name '*.' -exec cat {} \; | python mapper1.py | sort | python reducer1.py > ../data/n-conns-inbox.txt

find ../data/enron-emails-sent/* -name '*.' -exec cat {} \; | python mapper1.py | sort | python reducer1.py > ../data/n-conns-inbox.txt