The objective of this project is setting up an Hadoop cluster locally
on one computer using Docker containers and a MapReduce job to count the number of
missing value of the Country column.
Originally I developed the MapReduce application in Python but it appeared that Python was not installed on containers.
Due to this, I developed the application in Java.
The Hadoop cluster is set up with Docker containers. The following repository was used for the docker-compose configuration: https://github.com/big-data-europe/docker-hadoop
To run all the involved containers, after cloning the repository, run the following command:
docker-compose up -d
The Hadoop UI is accessible at http://localhost:9870
The application should be compilied to generate a jar. That is done with Maven on IntelliJ.
The jar is generated in the target directory.
Enter in the Docker container of namenode:
docker exec -it namenode bash
Once in the container, create a tp-map-reduce directory:
mkdir tp-map-reduce
Create a tp-map-reduce in HDFS:
hdfs dfs -mkdir /tp-map-reduce
Exit the container and send the local files to container:
exit
docker cp target/map-reduce-tp-1.0-SNAPSHOT.jar namenode:/tp-map-reduce/
docker cp InputData.csv namenode:/tp-map-reduce/
Enter the container and send the csv file to HDFS
hdfs dfs -mkdir /tp-map-reduce
hdfs dfs -put /tp-map-reduce/InputData.csv /tp-map-reduce/InputData.csv
hadoop jar tp-map-reduce/map-reduce-tp-1.0-SNAPSHOT.jar Runner /tp-map-reduce/InputData.csv /tp-map-reduce/out
The application calculate the total number of missing values for the Country column of the csv file. Once the calculation is done, the result is written in HDFS in /tp-map-reduce/out
It appears there is no missing values for the Country column.
This also can be verified using Python:
Given there was no missing values for the Country column and I wanted to see the result of an aggregation using MapReduce I made another MapReduce application that calculate the total number of characters in the Country column.
In order to achieve that I developed another Mapper (WordCountMapper), the reducer remains the same.