This code is meant to run through Amazon EMR as a two step process for tiling.
To run it as a standalone process, see run-emr.sh
below.
The chunk.py
is a pyspark application that runs code which takes a set of images and converts them into 1024 by 1024 tiled GeoTiffs
according to the mercator tile zoom level that matches the source image's resolution closest.
There is a single input argument, which is the local or s3 path to a JSON file that specifies the request. The request JSON looks like this:
{
"jobId": "emr-test-job-full",
"target": "s3://oam-tiles/oam-tiler-test-emr-full",
"images": [
"http://hotosm-oam.s3.amazonaws.com/356f564e3a0dc9d15553c17cf4583f21-12.tif",
"http://oin-astrodigital.s3.amazonaws.com/LC81420412015111LGN00_bands_432.TIF"
],
"workspace": "s3://workspace-oam-hotosm-org/emr-test-job-full"
}
In this input, we see that we have two input images. The order of the images in the images
array will determine the priority of the images
in the later mosaic step; images that have a lower index (higher on the list) will take priority over images that have a higher index (lower on the list). So in this
example, the pixels of 356f564e3a0dc9d15553c17cf4583f21-12.tif
would be overlayed on top of the pixels of LC81420412015111LGN00_bands_432.TIF
(356f564e3a0dc9d15553c17cf4583f21-12.tif
would be "above" LC81420412015111LGN00_bands_432.TIF
).
This set of chunked imagery is dumped into a working folder inside an S3 bucket, determined by the workspace
field in the input JSON.
Also placed in that folder is a step1_result.json
object that looks like the following:
{
"jobId": "emr-test-job-full",
"target": "s3://oam-tiles/oam-tiler-test-emr-full",
"tileSize": 1024,
"input": [
{
"gridBounds": {
"rowMax": 27449,
"colMax": 48278,
"colMin": 48269,
"rowMin": 27438
},
"tiles": "s3://workspace-oam-hotosm-org/emr-test-job-full/356f564e3a0dc9d15553c17cf4583f21-12",
"zoom": 16,
"extent": {
"xmin": 85.1512716462415,
"ymin": 28.028055072893892,
"ymax": 28.079387757442355,
"xmax": 85.20296835795914
}
},
{
"gridBounds": {
"rowMax": 434,
"colMax": 753,
"colMin": 746,
"rowMin": 427
},
"tiles": "s3://workspace-oam-hotosm-org/emr-test-job-full/LC81420412015111LGN00_bands_432.TIF",
"zoom": 10,
"extent": {
"xmin": 82.52879450382254,
"ymin": 26.359556467334755,
"ymax": 28.486769014205997,
"xmax": 84.86307565466431
}
}
]
}
The Mosaic step is a Scala Spark application that reads in the GeoTiffs and result information from the previous step and performs the mosaicing.
It outputs a set of web mercator z/x/y
256 x 256 PNG tiles to the target s3 or local directory.
To build the JAR that is to be uploaded and referenced in the EMR step, go into the mosaic
folder and run:
./sbt assembly
Then do something like:
aws s3 cp target/scala-2.10/... s3://oam-server-tiler/emr/mosaic.jar
There are shell scripts that will run this against EM in the root of the repository. It uses the awscli
tool to run EMR commands.
This will build the JAR file for the mosaic step and upload the chunk.py
and mosaic.jar
to the appropriate place in s3.
This will launch a long-running EMR cluster to run steps against. This is useful for testing, when you want to avoid waiting for a cluster to be bootstrapped while running jobs in a row.
This will output a cluster id, which should be set into the add-step*
scripts so that they run against the correct cluster.
This will run the tiling process against a long-running cluster deployed with launch-cluster.sh
. Just make sure to set the correct
CLUSTER_ID
at the top. Also, make sure to set the NUM_EXECUTORS
and EXECUTOR_MEMORY
correctly, based on the number of nodes in
the cluster (one executor per core per worker node). 2304m
is the amount of memory to give one executor for the 4-core m3.xlarge
nodes.
Also, you'll need to set REQUEST_URI
and WORKSPACE_URI
, which set the request for the chunk step and the location of the workspace for
the second step, respectively.
This is how to run a one-off, standalone tiling job. To change the number of
nodes in the cluster, the spot instance bid price, instance types, job id, and (most
importantly) input imagery, edit it before running. To see status, use the EMR
console. When steps are complete, imagery will have been pushed to the path
specified in target
(in the job configuration).
Follow these instructions: Namely set up the ssh tunnel using: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html
ssh -i ~/mykeypair.pem -N -D 8157 hadoop@ec2-###-##-##-###.compute-1.amazonaws.com
Use foxyproxy to set up proxy forwarding; instructions are found here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-proxy.html However, use the following foxyproxy settings instead of the one listed in that doc (this enables getting to the Spark UI)
<?xml version="1.0" encoding="UTF-8"?>
<foxyproxy>
<proxies>
<proxy name="emr-socks-proxy" id="56835462" notes="" enabled="true" color="#0055E5" mode="manual" autoconfMode="pac" lastresort="false">
<manualconf host="localhost" port="8157" socksversion="5" isSocks="true" />
<autoconf url="" reloadPac="false" loadNotification="true" errorNotification="true" autoReload="false" reloadFreqMins="60" disableOnBadPAC="true" />
<matches>
<match enabled="true" name="*ec2*.amazonaws.com*" isRegex="false" pattern="*ec2*.amazonaws.com*" reload="true" autoReload="false" isBlackList="false" isMultiLine="false" fromSubscription="false" caseSensitive="false" />
<match enabled="true" name="*ec2*.compute*" isRegex="false" pattern="*ec2*.compute*" reload="true" autoReload="false" isBlackList="false" isMultiLine="false" fromSubscription="false" caseSensitive="false" />
<match enabled="true" name="10.*" isRegex="false" pattern="http://10.*" reload="true" autoReload="false" isBlackList="false" isMultiLine="false" fromSubscription="false" caseSensitive="false" />
<match enabled="true" name="*ec2.internal*" isRegex="false" pattern="*ec2.internal*" reload="true" autoReload="false" isBlackList="false" isMultiLine="false" fromSubscription="false" caseSensitive="false" />
<match enabled="true" name="*compute.internal*" isRegex="false" pattern="*compute.internal*" reload="true" autoReload="false" isBlackList="false" isMultiLine="false" fromSubscription="false" caseSensitive="false" />
</matches>
</proxy>
</proxies>
</foxyproxy>
After foxyproxy is enabled and the SSH tunnel is set up, you can click on the Resource Manager
in the AWS UI for the cluster, which takes you to the YARN UI. For a running job, there will be an Tracking UI Application Master
link all the way to the right for that job. Click on that, and it should bring up the Spark UI.
10 m3.xlarge worker nodes and 1 m3.xlarge master at spot prices (around 5 cents an hour) About 4 cents, 40 cents for workers, 4 cents for master. Tiling job for under 50 cents. 9.5 G of imagery. (32 images) 13 minutes of processing ~15 minutes of spin up time for cluster