Download MODIS tiles, mosaic, calculate output data, reproject to WGS84 GeoTiffs
A so-far-incomplete development of code to acquire global geotiffs of MODIS variables using Google Cloud Dataflow.
In MAP we need global 8-daily Geotiffs with 30 arcsecond resolution and WGS84 projection for each of 5 MODIS-derived variables: Day and Night Land Surface Temperature, EVI, TCB, and TCW. These are derived from two MODIS products: MOD11A2 and MCD43D*
To create these there are several necessary steps:
- Identify the days and tiles for which data are needed. There are up to 317 tiles per global mosaic and these are available (for the products we use) at 8-daily intervals.
- Identify the URLs to download the necessary tiles which are HDF files
- Download the files from the NASA data pool (authentication is needed)
- Create a virtual mosaic of the relevant layers from each HDF file (each may contain several data layers)
- Read one or more of those mosaics as raster bands and calculate the output data (e.g. to convert scaled integer values into float temperatures in celsius, or calculate EVI values from reflectance band data
- Reproject those output data into WGS84 geotiffs in the standard format we use
Historically I have done this in several stages:
- use a tool such as pyModis or getModis to download all the tiles, optionally running this in parallel (using xargs -P, or the windows equivalent ppx2) to download in several streams
- for each output day, use a batch script to:
- call gdalbuildvrt to generate the "input" global data bands
- call a python script to calculate the necessary output values (like gdal_calc.py), saving to a tiff file in the original (sinusoidal) projection
- call gdalwarp to reproject this tiff into the final WGS84 30-arcsecond compressed tiffs
- run several instances of the batch script in parallel again using xargs / ppx2 Input, temporary, and output locations in the batch script were chosen to maximise throughput on a well-specced desktop workstation, in particular using a RAM disk for the temporary unprojected tiffs. This process all worked ok but was still fairly manual. The code is in a separate very out-of-date repo at https://github.com/harry-gibson/modis-acquisition. Other colleagues have developed similar processes using slightly different tools.
The present repo contains an attempt to reproduce this pipeline using Google Dataflow (Apache Beam) with the hope that it can be run with less oversight and hassle, and not relying on local disks, network, and computation, all of which are used heavily.
First the various steps have been collated into a single place to work out what we need to do (see notebooks/initial-coalesce-workflow).
Next these steps have been translated into Apache Beam code (PTransforms or DoFns and a pipeline). This was done using google dataflow notebooks with the Beam Interactive Runner (see notebooks/dev-pipeline-on-dataflow-notebooks).
Finally we will transfer this into a standard dataflow submission script (TBD).
It is highly-recommended to run the Dataflow pipeline within a virtual environment. Create a virtual env and install the dependencies inside it:
python3 -m virtualenv venv
venv/bin/pip install -r requirements.txt
or
python3 -m venv venv
source venv/bin/activate
pip3 install wheel
pip3 install apache-beam[gcp]
You can customize how workers were set-up by updating the setup.py
file. You
can add more commands by adding new lines in the CUSTOM_COMMANDS
list. For
example:
CUSTOM_COMMANDS = [
["apt-get", "update"],
["apt-get", "--assume-yes", "install", "libjpeg62"],
]
You can also add more dependencies for each of your workers by filling-in the
requirements-worker.txt
file. Lastly, you can edit the pipeline by updating
main.py
.
In order to run the Dataflow pipeline, execute the following command:
venv/bin/python3 main.py
You can pass multiple parameters such as the number of workers, machine type,
and more. For example, let's run 10 n1-standard-1
workers:
venv/bin/python3 main.py --num-workers 10 --machine-type "n1-standard-1"
As best practice, we recommend running your pipeline locally. You can do this
by passing the --local
flag.
venv/bin/python3 main.py --local
Parameter | Type | Description |
---|---|---|
-n , --num-workers |
int | Number of workers to run the Dataflow job. |
-m , --machine-type |
str | Machine type to run the jobs on. |
-s , --disk-size |
int | Disk size (in GB) for each worker when job is run. |
--project |
str | Google Cloud Platform (GCP) project to run the Dataflow job |
--region |
str | Google Cloud Platform (GCP) region to run the Dataflow job |
--artifact-bucket |
str | Google Cloud Storage (GCS) bucket to store temp and staging files |
This project was generated using the standard
template from
ljvmiranda921/dataflow-cookiecutter.
Have other templates in mind? Feel free to make a Pull Request!