This README file provides detailed instructions on implementing an inverted index using MapReduce in Java and running it on a Docker Hadoop cluster. Additionally, it explains how to monitor the application using the Resource Manager.
The inverted index is a data structure commonly used in information retrieval systems to quickly find documents that contain specific words. It maps each term to a list of documents in which the term appears. In this project, we'll implement an inverted index using MapReduce, a programming model for processing large datasets in parallel.
Before getting started, make sure you have the following prerequisites installed on your machine:
Docker: To run the Hadoop cluster in containers. Java Development Kit (JDK): To compile and run the Java code. Apache Maven: To manage dependencies and build the project.
Docker: Install Docker on your machine. You can download Docker from the official website: https://www.docker.com/get-started
Clone the Hadoop Docker repository:
git clone https://github.com/big-data-europe/docker-hadoop.git
Navigate to the cloned repository:
cd docker-hadoop
Build the Docker images:
docker-compose up -d
This command will download the necessary Docker images and start the Hadoop cluster.
Verify that the cluster is running by accessing the Hadoop Resource Manager UI:
http://localhost:8088
You should see the Resource Manager UI displaying cluster information.
-
Create a new Java project in your preferred IDE.
-
Add the Hadoop dependencies to your project. If you are using Maven, include the following dependencies in your pom.xml:
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependencies>
check mu pom.xml included in repo
-
Implement the MapReduce job to generate the inverted index. You can refer to Hadoop's MapReduce documentation and examples to understand the implementation details.
-
Build the Java project to generate a JAR file containing your MapReduce job.
Copy the input data that you want to process to the Hadoop cluster. You can use the following command to copy a file to the Hadoop Distributed File System (HDFS) within the Docker container:
docker cp <local-file> namenode:/input/
Replace <local-file>
with the path to your local input file.
Start a shell session in the Hadoop NameNode Docker container:
docker-compose exec namenode bash
http://localhost:8088
You can view the running job, its status, and resource usage on the Resource Manager UI.
Once the job completes, you can retrieve the output from HDFS using the following command:
hdfs dfs -get /output <local-output-directory>
Replace <local-output-directory>
with the desired path on your local machine to store the output.
The output file is part-r-000