Run Hadoop program locally with Intellij and Maven

You may use this method to write and test your Hadoop program locally without configuring Hadoop environment on your own machine or using the clusters. This tutorial is based on Hadoop: Intellij结合Maven本地运行和调试MapReduce程序 (无需搭载Hadoop和HDFS环境) , How-to: Create an IntelliJ IDEA Project for Apache Hadoop and Developing Hadoop Mapreduce Application within Intellij IDEA on Windows 10.

Prerequisites

Oracle JDK 8
- JDK 8 can have name JDK 8 and JDK 1.8, do not use JDK 18.
- Oracle JDK is strongly recommended, OpenJDK may have unexpected errors.
- Other versions of JDK, like 9, 11, etc, may not work.
Apache Maven
Apache Hadoop
IntelliJ IDEA

Software Installation

Install Oracle JDK 8

Current latest version: 8u333 (1.8.0_333).

Linux (JDK 8)

Download from https://www.oracle.com/java/technologies/downloads/#java8-linux (free account required), x64 Compressed Archive (filename jdk-8u333-linux-x64.tar.gz) is recommended since it does not require installation.
- Download x86 Compressed Archive (filename jdk-8u333-linux-i586.tar.gz) if your system is 32-bit.
Untar the download archive to anywhere you like, the path should not contain any space. For example, ~/jdk1.8.0_333.

macOS (JDK 8)

Download from https://www.oracle.com/java/technologies/downloads/#java8-mac (free account required), only x64 DMG Installer (filename jdk-8u333-macosx-x64.dmg) is provided.
Mount the dmg file and install JDK 8 to the default location /Library/Java/JavaVirtualMachines/jdk1.8.0_333.jdk.

Windows (JDK 8)

If your Windows account name (you can find it under C:\Users) contains space, create another account without space, otherwise Hadoop will not work properly.
Download from https://www.oracle.com/java/technologies/downloads/#java8-windows (free account required), x64 Installer (filename jdk-8u333-windows-x64.exe) is recommended
- Download x86 Installer (filename jdk-8u333-windows-i586.exe) if your system is 32-bit.
Install it, but DO NOT install it to the default location (under C:\Program Files or C:\Program Files (x86)). Instead, install it to a location with no space in the path, such as C:\jdk1.8.0_333.
1. During the installation, change the Intall to of Development Tools to C:\jdk1.8.0_333 (remove Program Files\Java\ from the default path C:\Program Files\Java\jdk1.8.0_333).
2. Click the drive icon of Source Code and Public JRE, and disable them.
3. The custom setup page should look this this.

Install Apache Maven

Linux and macOS (Maven)

Download Binary tar.gz archive (filename apache-maven-3.8.5-bin.tar.gz) from https://maven.apache.org/download.cgi.
Decompress the archive to ~/apache-maven-3.8.5

Windows (Maven)

Download Binary zip archive (filename apache-maven-3.8.5-bin.zip) for Windows from https://maven.apache.org/download.cgi.
Decompress the archive to C:\apache-maven-3.8.5

Install Apache Hadoop

Linux and macOS (Hadoop)

Download the binary of the latest version 3.3.3 from https://hadoop.apache.org/releases.html
Untar the downloaded .tar.gz file to ~/hadoop-3.3.3.

Windows (Hadoop)

Hadoop on Windows must be patched, otherwise it will not work at all. The latest patch available is for Hadoop 3.2.2, so you should use an older version of Hadoop.

Download the binary of version 3.2.2 from https://archive.apache.org/dist/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz.
Use 7-Zip to untar the downloaded .tar.gz file.
1. Use 7-Zip to open the downloaded hadoop-3.2.2.tar.gz.
2. Double click hadoop-3.2.2.tar, it takes some time to decompress.
3. Select the folder hadoop-3.2.2, then click Extract. Do not directly drag the folder to untar.
4. In the dialog, change Copy to path to C:\, then OK.
5. It takes some to untar the files. If you see errors like Cannot create symbolic link: ..., just click Close to ignore those errors. Windows does not support Unix style symbolic links, and these error files will not be used on Windows. So it is safe to ignore these errors.
Download the patch.
1. Open https://download-directory.github.io/, paste https://github.com/cdarlint/winutils/tree/master/hadoop-3.2.2/bin into the text box and press Enter. This will download you a single folder of a Github repository. The downloaded file should be cdarlint winutils master hadoop-3.2.2-bin.zip.
2. Open the downloaded zip file, and extract all the 15 files to C:\hadoop-3.2.2\bin and overwrite the existing files (there should be 8 files to overwrite).
- If https://download-directory.github.io/ is down, you can find some other tools to download a single Github folder. Or you can clone https://github.com/cdarlint/winutils, and copy the files inside winutils\hadoop-3.2.2\bin\.

Install IntelliJ Community Edition

Download and install the latest version from https://www.jetbrains.com/idea/download/.

If you use Ubuntu, you can install it from the Software Center, make sure you download the community edition. Or you can use the following command

sudo snap install intellij-idea-community --classic

Alternative Solutions for Windows

Hadoop may still not work properly on Windows. You can either install a Linux in a virtual machine like VirtualBox, or install WSL, then download all software and configure using the instructions for Linux.

Environment Configuration

Linux and macOS (Environment)

Run the following command
```
touch ~/.profile
```

Add the following lines to ~/.profile.

# Use this line below for Linux
export JAVA_HOME="/home/$LOGNAME/jdk1.8.0_333"
# Use this line below for macOS
export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_352.jdk/Contents/H$"
export JAVA_HOME=$(/usr/libexec/java_home)

# Copy the following lines for Linux
export MAVEN_HOME="/home/$LOGNAME/apache-maven-3.8.5"
export HADOOP_HOME="/home/$LOGNAME/hadoop-3.3.3"
# Copy the following lines for macOS
export MAVEN_HOME="/Users/$LOGNAME/apache-maven-3.8.5"
export HADOOP_HOME="/Users/$LOGNAME/hadoop-3.3.3"

export PATH=$HADOOP_HOME/bin:$MAVEN_HOME/bin:$JAVA_HOME/bin:$PATH

Restart the terminal, or run
```
source ~/.profile
```
Test *_HOME, run
```
echo $JAVA_HOME
echo $MAVEN_HOME
echo $HADOOP_HOME
```
They should not print empty lines, but the paths you just configured.
Test Java compiler, run
```
javac -version
```
It should print
```
javac 1.8.0_333
```

Test Maven, run

mvn -version

It should print (path and OS information may differ)

Apache Maven 3.8.5 (3599d3414f046de2324203b78ddcf9b5e4388aa0)
Maven home: /Users/Merlin/apache-maven-3.8.5
Java version: 1.8.0_333, vendor: Oracle Corporation, runtime: /Library/Java/JavaVirtualMachines/jdk1.8.0_333.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "12.3", arch: "x86_64", family: "mac"

Test Hadoop, run

hadoop version

It should print

Hadoop 3.3.3
Source code repository https://github.com/apache/hadoop.git -r d37586cbda38c338d9fe481addda5a05fb516f71
Compiled by stevel on 2022-05-09T16:36Z
Compiled with protoc 3.7.1
From source with checksum eb96dd4a797b6989ae0cdb9db6efc6
This command was run using /Users/Merlin/hadoop-3.3.3/share/hadoop/common/hadoop-common-3.3.3.jar

On macOS, if you see (base) in the beginning of every command line, you likely have Conda (Anaconda or miniconda) installed and have it auto-activated. This overrides your environment settings that you must do a source every time. You can disable its auto-activation by this command:

conda config --set auto_activate_base false

Windows (Environment)

Search for View advanced system settings in the task bar (case insensitive) and open it.
Click Environment Variables....
Click the first New button to create an environment for the current user (you can also do this and the following steps for System variables, but make you all your changes are all for the current user or for the system).
Repeat step 3 for the following 3 variables:

Variable name Variable value

JAVA_HOME C:\jdk1.8.0_333

MAVEN_HOME C:\apache-maven-3.8.5

HADOOP_HOME C:\hadoop-3.2.2
You shall see the 3 variables added.
Find Path under the System variables, double click or click Edit to edit it. If you find something similar to C:\Program Files\Common Files\Oracle\Java\javapath, Delete this it from Path.
Double click the variable Path (if you add the 3 *_HOME variables for the user, edit Path for the user, otherwise, edit Path for the system).
Add the following 3 lines (you can click New or just double click on an empty line).
```
%JAVA_HOME%\bin
%MAVEN_HOME%\bin
%HADOOP_HOME%\bin
```
Click OK multiple times until the System Properties dialog (step 2) is closed. Then restart Command Promot or Windows Terminal.
Test *_HOME, run
- Command Promot
```
echo %JAVA_HOME%
echo %MAVEN_HOME%
echo %HADOOP_HOME%
```
- PowerShell (Windows Terminal)
```
echo $Env:JAVA_HOME
echo $Env:MAVEN_HOME
echo $Env:HADOOP_HOME
```
They should not print empty lines, but the paths you just configured.
Test Java compiler, run
```
javac -version
```
It should print
```
javac 1.8.0_333
```

Variable name	Variable value
`JAVA_HOME`	`C:\jdk1.8.0_333`
`MAVEN_HOME`	`C:\apache-maven-3.8.5`
`HADOOP_HOME`	`C:\hadoop-3.2.2`

Test Maven, run

mvn -version

It should print (path and OS information may differ)

Apache Maven 3.8.5 (3599d3414f046de2324203b78ddcf9b5e4388aa0)
Maven home: C:\apache-maven-3.8.5
Java version: 1.8.0_333, vendor: Oracle Corporation, runtime: C:\jdk1.8.0_333\jre
Default locale: en_US, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"

Test Hadoop, run

hadoop version

It should print

Hadoop 3.2.2
Source code repository Unknown -r 7a3bc90b05f257c8ace2f76d74264906f0f7a932
Compiled by hexiaoqiao on 2021-01-03T09:26Z
Compiled with protoc 2.5.0
From source with checksum 5a8f564f46624254b27f6a33126ff4
This command was run using /C:/hadoop-3.2.2/share/hadoop/common/hadoop-common-3.2.2.jar

Create WordCount Project in IntelliJ

Open IntelliJ, click New Project.
Expand Advanced Settings, set GroupId to edu.ucr.cs.merlin (you can change the GroupId by any string you like), set ArtifactId to wordcount. The Name should automatically change, you can give it a different name if you want. Make sure JDK is showing 1.8 or 8. Build system should be Maven. Click Create.
It takes some time to load and download necessary dependencies (there will be a progress bar in the bottom).
File pom.xml should open automatically.

Add the following line to <properties> block.

Linux and macOS
```
<hadoop.version>3.3.3</hadoop.version>
```
Windows
```
<hadoop.version>3.2.2</hadoop.version>
```

Add the following blocks to the XML root <project>.

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
</dependencies>
<repositories>
    <repository>
        <id>apache</id>
        <url>http://maven.apache.org</url>
    </repository>
</repositories>

pom.xml should look like this.

Click to expand

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>edu.ucr.cs.merlin</groupId>
    <artifactId>wordcount</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <hadoop.version>3.3.3</hadoop.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
    </dependencies>
    <repositories>
        <repository>
            <id>apache</id>
            <url>http://maven.apache.org</url>
        </repository>
    </repositories>
</project>

Click the floating m icon to reload Maven dependencies.
In the left Project Browser, select src → main → java. Right click, select New, then Package.
Set the package name to edu.ucr.cs.merlin (you may use other name).
In the left Project Browser, select the package you just created. Right click, select New, then Java Class.
Set the class name to WordCount.

In the left Project Browser, open WordCount class, paste the following code (keep the first package line).

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

    public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private final Text word = new Text();

        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer
            extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                        Context context
        ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

This is from the original Hadoop MapReduce Tutorial.

Right click WordCount either in the Project Browser or on the open tab (circled in the above image), select Run 'WordCount.main()'.

IntelliJ should compile your code and run the class. But this time it will fail, you shall see the following messages.

log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
    at edu.ucr.cs.merlin.WordCount.main(WordCount.java:59)

Process finished with exit code 1

This step is just to test whether your code compiles, and create a configuration for the class.

In the left Project Browser, create a folder input under your project's root folder, put any text file(s) into input for testing.
On the top right, find WordCount next to a green hammer icon, click the down triangle icon, and select Edit Configurations.
Check if the first box under Build and run is shown java 8 or java 1.8. Set input output to the third box (second line), and OK.
Run 'WordCount.main()' again.
- If you want to rerurn your program, you will need to delete the output folder before rerunning it, otherwise you will see some FileAlreadyExists exceptions.
The program will create an output folder with some files in it. The actual output files should have names part-r-#####.

Run WordCount from Command Line

Build a runnable JAR package, cd to your project folder, then run
```
mvn package
```

Run

Linux and macOS

hadoop jar ./target/wordcount-1.0-SNAPSHOT.jar edu.ucr.cs.merlin.WordCount input output

Windows

hadoop jar .\target\wordcount-1.0-SNAPSHOT.jar edu.ucr.cs.merlin.WordCount input output

Example output