A giter8 template for getting setup with Apache Spark in CDH.
There are two different ways to use this template.
In case of sbt’s launcher version 0.13.13 or above you can run sbt new squito/cdh-spark.g8
and follow the interactive prompts.
-
Install
giter8
-
Run
g8 squito/cdh-spark
and follow the prompts
For creating a CDH5 / Spark 1.x application a separate branch of this repository can be used: sbt new squito/cdh-spark.g8 --branch cdh5.x_spark1.x
.
-
Open an sbt session in project root :
sbt
then select the core projectproject core
-
Compile the code:
compile
-
Run the app:
runMain <your-package>.SparkWordCount local[*] <some input file>
. (If you don't specify an input file, it will just use the "pom.xml" sitting there. It'll work, but not very interesting.)
After a maven build (at least a mvn package
) execute mvn exec:java -Dexec.classpathScope="compile" -pl core -Dexec.mainClass="<your-package>.SparkWordCount" -Dexec.args="local[*] <some input file>"
-
Open IntelliJ
-
From the menu, choose "File / Import Project"
-
Choose the directory you have just created
-
Chose "Import Project From External Module / Maven"
-
Click through the remaining dialogs
-
Open up an sbt session:
sbt
-
Inside sbt, run
~compile
. Leave the sbt session open. After the first full compile, you'll see something like1. Waiting for source changes... (press enter to interrupt)
. -
Change code (with IntelliJ, vim, emacs, whatever). Save your code, and watch sbt recompile.
(first I need to write an example unit test)
You need to create a jar which contains all of your code & dependencies. However, you also want to make sure that your jar does not contain jars which are already available on the cluster. This will help keep the jar small, so it is quicker to package and send across the cluster (and also helps avoid confusing errors if multiple versions of a library are included on the classpath).
Instructions to build these libraries vary slightly depending on the build tool. Note that the project here has been carefully configured to enable packaging to work this way -- eg., every sbt project won't necessarily be able to build a jar like this.
After packaging your jar, you can launch a spark command on your cluster with spark-submit
; just supply
your jar to the --jars
argument. Eg.,
spark-submit --master yarn --jars my_cool_project-core_2.10-0.1.0-SNAPSHOT-jar-with-dependencies.jar com.mycompany.SparkWordCount
Execute sbt "project core" assembly
.
Execute mvn package
.
This will create a jar like core/target/my_cool_project-core_2.10-0.1.0-SNAPSHOT-jar-with-dependencies.jar
.
Add this to your --jars
argument to spark-submit
to run your code.