Running pipelines on AWS?
rsignell-usgs opened this issue · 1 comments
rsignell-usgs commented
Hello! Newbie here. :)
Any tips for someone who would like to try an xarray-beam rechunking pipeline on AWS?
(USGS is wedded to AWS at the moment)
alxmrs commented
Sorry for the late reply. I'm not too familiar with how we can run this on AWS (I haven't done it myself), but here's how I would do it:
- Set up a (py)Spark cluster in AWS EMR, see the AWS docs or this step-by-step Medium post.
- Specify the
SparkRunner
in the--runners
flag of your python Beam Pipeline.- See also: https://beam.apache.org/documentation/runners/spark/#running-on-a-pre-deployed-spark-cluster
- Likely, you'll need to install apache_beam with AWS's extra requirements:
pip install apache_beam[aws]
.
I hope that helps.