/spark-emr

Spark EMR (Spot) cluster manager

Primary LanguagePythonMIT LicenseMIT

Spark-EMR

Build Status

Run an python package on AWS EMR

Install

Develop install:

$ pip install -e .

Testing:

$ pip install tox
$ tox

Setup

The easiest way to get EMR up and running is to go through the Web-Interface and create a ssh key, and start a cluster by hand. This will then create the needed subnet_key and EMR roles.

Config yaml file

Create a config.yaml per project or as a default into ~/.config/spark-emr.yaml

bootstrap_uri: s3://foo/bar
master: 
  instance_type: m4.large
  size_in_gb: 100
core: 
  instance_type: m4.large
  instance_count: 2
  size_in_gb: 100
ssh_key: XXXXX
subnet_id: subnet-XXXXXX
python_version: python36
emr_version: emr-5.20.0
consistent: false
optimization: false
region: eu-central-1
job_flow_role: EMR_EC2_DefaultRole 
service_role: EMR_DefaultRole

CLI-Interface

Start

To run a python code on EMR you need to build a proper python package aka setup.py with console_scripts the script needs to end on .py or yarn won't be able to execute it |-(

Bootstrap a cluster, install the pypackage, execute the task in cmdline, poll cluster until finished, stop cluster:

$ spark-emr start \
[--config config.yaml] \
--name "Spark-ETL" \
--bid-master 0.04 \
--bid-core 0.04 \
--cmdline "etl.py --input s3://in/in.csv --output s3://out/out.csv" \
--tags foo 2 bar 4 \
--poll \
--yarn-log \
--package "../"

Running with a released pypackage version (pip):

$ spark-emr start \
... \
--package pip+etl_pypackage

Status

Returns the status of a cluster (also terminated ones):

$ spark-emr status --cluster-id j-XXXXX

List

List all cluster and filter optionally by tag:

$ spark-emr list [--config config.yaml] [--filter somekey somevalue]

Stop

Stop a running cluster:

$ spark-emr stop --cluster-id j-XXXXX

Spot price check

This call returns for all regions and configured instances the spot price:

$ spark-emr spot

Appendix

Running commands on EMR

The created command can also be run directly from the master:

$ /usr/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=python35 \
--conf spark.executorEnv.PYSPARK_PYTHON=python35 \
/usr/local/bin/etl.py --input s3://in/in.csv --output s3://out/out.csv

Running commands on docker

To test if our spark is running as expected we can run it locally in docker.

$ git clone https://github.com/delijati/spark-docker
$ cd spark-docker
$ docker build . --pull -t spark

Now we can run our spark job locally.

$ docker run --rm -ti -v `pwd`/test/dummy:/app/work spark \
bash -c "cd /app/work && pip3 install -e . && spark_emr_dummy.py 10"