/CSC536_FinalProject_MapReduceCloud

An distributed MapReduce application using Akka Cluster, Docker and Zookeeper

Primary LanguageScala

CSC536_FinalProject_MapReduceCloud

An distributed MapReduce application using Akka Cluster, Docker and Zookeeper

a. Prerequisites
This project has been compiled and tested on:
MacOS High Sierra 10.13.4
Java 1.8.0_121
Scala 2.12.4
Docker 18.03.1-ce
sbt 1.1.4
This project also depends on the following packages (no need to install beforehand, sbt will takes care of them):
akka-actor 2.5.12
akka-remote 2.5.12
akka-cluster 2.5.12
constructr 0.19.0
constructr-coordination-zookeeper 0.4.0
sbt-native-packager 1.3.4
zookeeper 3.4.10

b. Compile
Open a Terminal window on your computer.
The very first step is to stop and remove any Docker containers that are currently running by using:
docker stop $(docker ps -a -q)
And
docker rm $(docker ps -a -q)
After removal of all existing Docker containers, navigate to the project folder “/CSC536_FinalProject” and compile all source codes by using:
sbt docker:publishLocal
This will compile all source codes and prepare them for being used in Docker containers.

c. Execution
Exit sbt first and simply using the following command for executing the system on Docker:
docker-compose up
The configuration file “docker-compose.yml” has defined with one Zookeeper node, one master node, two reducer nodes and two mapper nodes. You can add more mappers or reducers by modifying that file.
You will see many messages exchanging before the cluster convergence. After 30 seconds (an arbitrary time interval set to wait for the cluster to converge), the master node will send out two Book messages to mappers. And after another 30 seconds (another arbitrary time interval set to wait for mappers/reducers to finish their jobs), the master node will broadcast out Flush message and you will see final results displayed on screen.

d. System Design
The overall implementation contains the following components:
• Zookeeper node: responsible for bootstrapping and service discovery, serves as a seed node. Zookeeper is similar to etcd but it is easier to configure. In my implementation, I used the ConstructR-zookeeper package to achieve its functionality.
• MasterActor node: responsible for sending Book jobs and coordinating with other cluster nodes. There will be only one master node.
• MapActor node (mapper): responsible for receiving Book job messages, mapping each word with its title, and sending Word messages to reducers.
• ReduceActor node (reducer): responsible for receiving Word job messages, putting them into reduce maps and outputting the final results.
• A group of messages such as Book, Word, Done and Flush.
• A Main object serves as the entry point of the entire application.
During execution, each component (node) runs in a separated Docker container but they are forming as a single Akka cluster. The Zookeeper node is responsible for bootstrapping and service discovery for every other cluster nodes. The master node will coordinate for all the reducers and mappers. It will wait for 30 seconds after starting up to make the cluster to converge. And then it will send Book job messages to mappers. After sending all Book jobs, it then waits for another 20 seconds before broadcasting the Flush message and wait for the mappers/reducers to reply.