tags | projects | ||||
---|---|---|---|---|---|
|
|
This guide walks you through the process of creating a Spring Hadoop YARN application.
What you’ll build
You’ll build a simple Hadoop YARN application with Spring Hadoop and Spring Boot.
What you’ll need
https://raw.githubusercontent.com/spring-guides/getting-started-macros/master/prereq_editor_jdk_buildtools.adoc - Local single-node instance based on Hadoop 2.2.0 or later. The Apache Hadoop site has some instructions.
Hadoop YARN Intro
If you have been following the Hadoop community over the past year or two, you’ve probably seen a lot of discussions around YARN and the next version of Hadoop’s MapReduce called MapReduce v2. YARN (Yet Another Resource Negotiator) is a component of the MapReduce project created to overcome some performance issues in Hadoop’s original design. The fundamental idea of MapReduce v2 is to split the functionalities of the JobTracker, Resource Management and Job Scheduling/Monitoring, into separate daemons. The idea is to have a global Resource Manager (RM) and a per-application Application Master (AM). A generic diagram for YARN component dependencies can be found on the Hadoop page describing the YARN architecture.
MapReduce Version 2 is a re-write of the original MapReduce code run as an application on top of YARN. It is also possible to write other types of applications, having nothing to do with MapReduce, and then run them on YARN. However, the YARN APIs are complex and writing a custom YARN based application is difficult. The YARN APIs are low-level infrastructure APIs, not high-level developer APIs.
Spring YARN Intro
The development process for a YARN application, from the moment when a developer starts his or her work to the point when someone actually executes the application on a Hadoop cluster, is a bit more complicated than just creating a few lines of "Hello world!" code.
Let’s see what needs to be considered:
-
What is the project structure for the application code?
-
How is the project built and packaged?
-
How is the packaged application configured?
-
How is the final application executed on YARN?
We believe that Spring YARN and Spring Boot creates a very clear story for how above topics could be handled.
At a high level, Spring YARN provides three different components, YarnClient
, YarnAppmaster
and YarnContainer
which together can be called a Spring YARN Application. We provide default implementations for all components while still giving the end user an option to customize as much as he or she wants.
In a pure Hadoop environment it has always been a cumbersome process to get your own code packaged, deployed and executed on a Hadoop cluster. Should you just put your compiled package in Hadoop’s classpath, or rely on Hadoop’s tools to copy your artifacts into Hadoop during the job submission? What about if your own code depends on some library that isnt already present on Hadoop’s default classpath? Even worse, what about if the dependencies in your code collides with libraries already on Hadoop’s default classpath?
With Spring Boot you can work around all these issues. You either create an executable jar (sometimes called an uber or fat jar) which bundles all dependencies, or a zip package which can be automatically extracted before the code is about to be executed. In the latter case, it’s possible to re-use entries already available on Hadoop’s default classpath.
In this guide we are going to show how these 3 components, YarnClient
, YarnAppmaster
and YarnContainer
are packaged into executable jars using Spring Boot. Internally Spring Boot rely heavy on application auto-configuration and Spring YARN adds its own auto-configuration magic. The application developer can then concentrate on his or her own code and application configuration instead of spending a lot of time trying to understand how all the components should integrate with each other.
Set up the project
build.gradle
link:initial/build.gradle[]
settings.gradle
link:initial/settings.gradle[]
In the above gradle build file we simply create three different jars, each having classes for its specific role. These jars are then repackaged by Spring Boot’s gradle plugin to create an executable jar.
Create a Yarn Container
Here you create ContainerApplication
and HelloPojo
classes.
gs-yarn-basic-container/src/main/java/hello/container/ContainerApplication.java
link:complete/gs-yarn-basic-container/src/main/java/hello/container/ContainerApplication.java[]
In the above ContainerApplication
, notice how we added the @Configuration
annotation at the class level and the @Bean
annotation on the helloPojo()
method. We have jumped a little bit ahead of what you most likely expect us to do. We previously mentioned YarnContainer
component which is an interface towards what you’d execute in your containers. You could define your custom YarnContainer
to implement this interface and wrap all logic inside of that implementation.
However, Spring YARN defaults to a DefaultYarnContainer
if none is defined and this default implementation expects to find a specific bean type from a Spring Application Context
having the real user facing logic what container is supposed to do.
gs-yarn-basic-container/src/main/java/hello/container/HelloPojo.java
link:complete/gs-yarn-basic-container/src/main/java/hello/container/HelloPojo.java[]
HelloPojo
class is a simple POJO
in a sense that it doesn’t extend any Spring YARN base classes. What we did in this class:
-
We added a class level
@YarnComponent
annotation. -
We added a method level
@OnContainerStart
annotation -
We
@Autowired
a Hadoop’sConfiguration
class
@YarnComponent
is a stereotype
annotation, providing a Spring @Component
annotation.
This is automatically marking a class to be a candidate for having
@YarnContainer
functionality.
Within this class we can use @OnContainerStart
annotation to mark a public method with void
return type and no arguments act as an entry point for some application code that needs to be executed on Hadoop.
To demonstrate that we actually have some real functionality in this class, we simply use Spring Hadoop’s @FsShell
to list entries from the root of the HDFS
file system. We needed to have Hadoop’s Configuration
which is prepared for you so that you can just rely on autowiring for access to it.
Create a Yarn Appmaster
Here you create an AppmasterApplication
class.
gs-yarn-basic-appmaster/src/main/java/hello/appmaster/AppmasterApplication.java
link:complete/gs-yarn-basic-appmaster/src/main/java/hello/appmaster/AppmasterApplication.java[]
The application class for YarnAppmaster
looks even simpler than what we just did for ClientApplication
. Again the main()
method uses Spring Boot’s SpringApplication.run()
method to launch an application.
One might argue that if you use this type of dummy class to basically fire up your application, could we not use a generic class for this? Well simple answer is yes, we even have a generic SpringYarnBootApplication
class just for this purpose. You’d define that to be your main class for an executable jar and you’d accomplish this during the gradle build.
In real life, however, you most likely need to start adding more custom functionality to your application component and you’d do that by starting to add more beans. To do that you need to define a Spring @Configuration
or @ComponentScan
. AppmasterApplication
would then act as your main starting point to define more custom functionality. Effectively this is exactly what we do with a YarnContainer
in section below.
Create a Yarn Client
Here you create a ClientApplication
class.
gs-yarn-basic-client/src/main/java/hello/client/ClientApplication.java
link:complete/gs-yarn-basic-client/src/main/java/hello/client/ClientApplication.java[]
-
@EnableAutoConfiguration
tells Spring Boot to start adding beans based on classpath setting, other beans, and various property settings. -
Specific auto-configuration for Spring YARN components takes place in a same way than from a core Spring Boot.
The main()
method uses Spring Boot’s SpringApplication.run()
method to launch an application. From there we simply request a bean of type YarnClient
and execute its submitApplication()
method. What happens next depends on application configuration, which we go through later in this guide. Did you notice that there wasn’t a single line of XML?
Create an Application Configuration
Create a new yaml configuration file for all sub-projects.
gs-yarn-basic-container/src/main/resources/application.yml
gs-yarn-basic-appmaster/src/main/resources/application.yml
gs-yarn-basic-client/src/main/resources/application.yml
link:complete/gs-yarn-basic-client/src/main/resources/application.yml[]
Note
|
Pay attention to the yaml file format which expects correct indentation and no tab characters.
|
Final part for your application is its runtime configuration, which glues all the components together, which then can be executed as a Spring YARN application. This configuration act as source for Spring Boot’s @ConfigurationProperties
and contains relevant configuration properties which cannot be auto-discovered or otherwise needs to have an option to be overwritten by an end user.
This way you can define your own defaults for your environment. Because these @ConfigurationProperties
are resolved at runtime by Spring Boot, you even have an easy option to overwrite these properties either by using command-line options, environment variables or by providing additional configuration property files.
Run the Application
Now that you’ve successfully compiled and packaged your application, it’s time to do the fun part and execute it on Hadoop YARN.
To accomplish this, simply run your executable client jar from the projects root dirctory.
$ java -jar gs-yarn-basic-dist/target/gs-yarn-basic-dist/gs-yarn-basic-client-0.1.0.jar
Using the Resource Manager UI you can see status of an application.
To find Hadoop’s application logs, you need to do a simple find within the hadoop clusters configured userlogs directory.
$ find hadoop/logs/userlogs/ | grep std
hadoop/logs/userlogs/application_1395578417086_0001/container_1395578417086_0001_01_000001/Appmaster.stdout
hadoop/logs/userlogs/application_1395578417086_0001/container_1395578417086_0001_01_000001/Appmaster.stderr
hadoop/logs/userlogs/application_1395578417086_0001/container_1395578417086_0001_01_000002/Container.stdout
hadoop/logs/userlogs/application_1395578417086_0001/container_1395578417086_0001_01_000002/Container.stderr
Grep logging output from a HelloPojo
class.
$ grep HelloPojo hadoop/logs/userlogs/application_1395578417086_0001/container_1395578417086_0001_01_000002/Container.stdout
[2014-03-23 12:42:05.763] boot - 17064 INFO [main] --- HelloPojo: Hello from HelloPojo
[2014-03-23 12:42:05.763] boot - 17064 INFO [main] --- HelloPojo: About to list from hdfs root content
[2014-03-23 12:42:06.745] boot - 17064 INFO [main] --- HelloPojo: FileStatus{path=hdfs://localhost:8020/; isDirectory=true; modification_time=1395397562421; access_time=0; owner=root;
group=supergroup; permission=rwxr-xr-x; isSymlink=false}
[2014-03-23 12:42:06.746] boot - 17064 INFO [main] --- HelloPojo:
FileStatus{path=hdfs://localhost:8020/app; isDirectory=true;
modification_time=1395501405412; access_time=0; owner=hadoop; group=supergroup; permission=rwxr-xr-x; isSymlink=false}
Summary
Congratulations! You’ve just developed a Spring YARN application!
See Also
The following guides may also be helpful: