/spark-scala-maven-boilerplate-project

This is a skeleton of a Scala project with maven to start using Spark

Primary LanguageScala

Instructions:

Follow this article to find more detailed instructions.

Modify the class "MainExample.scala" writing your Spark code, then compile the project with the command:

mvn clean package

Inside the /target folder you will find the result fat jar called spark-scala-maven-project-0.0.1-SNAPSHOT-jar-with-depencencies.jar. In order to launch the Spark job use this command in a shell with a configured Spark environment:

spark-submit --class com.examples.MainExample \
  --master yarn-cluster \
  spark-scala-maven-project-0.0.1-SNAPSHOT-jar-with-depencencies.jar \
  inputhdfspath \
  outputhdfspath

The parameters inputhdfspath and outputhdfspath don't have to present the form hdfs://path/to/your/file but directly /path/to/your/files/ because submitting a job the default file system is HDFS. To retrieve the result locally:

hadoop fs -getmerge outputhdfspath resultSavedLocally