/geotrellis-spark-job.g8

sbt Template for Batch GeoTrellis Spark Jobs

Primary LanguageShellOtherNOASSERTION

geotrellis-spark-job.g8

This is a Giter8 template for performing ingests using GeoTrellis. In addition to loading, formatting, and saving data, this template also serves as a reference for how to perform such operations in GeoTrellis.

Requirements

Requirement Version
Spark >=2.4
Scala 2.11

Along with the above requirements, the environment variable, SPARK_HOME, must be set.

Setup

To setup the template, run the following command:

g8 geotrellis/geotrellis-spark-job.g8
# or
sbt new geotrellis/geotrellis-spark-job.g8

Once the command is run, a series of prompts regarding the new project will be presented. These values can be changed as needed.

Running

To run an ingest, first go to the root of the project and enter the sbt console with:

sbt

Then type, run followed by the parameters listed bellow to perform the ingest.

Command Description
--outputPath URI of input file
--inputPath URI of output file
--numPartitions Optional, the number of partitions to use during the ingest.

An example command would look like:

sbt:geotrellis-spark-job> test:runMain geotrellis.batch.Main --inputPath /tmp/cropped.tif --outputPath file:///tmp/test-catalog

EMR Configuration

This project uses sbt-lighter plugin to simplify EMR cluster configuration and Spark jobs deployment. It's important to remember, that some of the Amazon instances are EBS only instances, so check out instances types description carefully. By default, sbt-lighter plugin allocates only a 32Gb EBS volume, and as a consequence you may experience issues with Spark jobs completion. Consider allocating larger EBS volumes in such cases using the following configurations:

// size in GBs
// setting to None would disable EBS volumes provisioning
sparkMasterEbsSize := Some(64)
sparkCoreEbsSize := Some(64)

These should be placed into a build.sbt file. The other option would be to use a different instance type that doesn't require EBS volumes usage.