SparkleCluster is a template that leverages HDFS and YARN to run containerized Spark or MapReduce jobs. It provides a robust and scalable solution for processing large datasets.
- Distributed processing using Hadoop's YARN and HDFS.
- Containerized environment for easy setup and deployment.
- Spark or MapReduce for efficient data processing.
- Job History Server (HS) for monitoring and debugging.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
- Docker
- Docker Compose
-
Clone the SparkleCluster repository to your local machine.
-
Navigate to the project directory.
-
Build and start the Docker containers:
docker compose up -d
-
Access the
namenode
Docker container:docker exec -it namenode bash
-
Navigate to the
/opt/hadoop
directory:cd /opt/hadoop
-
Run the
run_map_reduce.sh
orrun_spark.sh
script to start the MapReduce job:./run_map_reduce.sh
You can monitor the progress of your MapReduce jobs and view completed jobs using the Hadoop Job History Server. By default, the Job History Server's web interface is accessible at http://<hostname>:19888
, where <hostname>
is the name or IP address of the machine running the Job History Server.
This project is licensed under the MIT License - see the LICENSE.md
file for details.