An AWS EMR cluster with Apache Livy and Apache Spark running. Access to S3 buckets, where files can be stored and Spark job output can be written The code in this repository has been tested on AWS EMR cluster for jobs which run from 30 min to 24 hrs.
Examples on how to submit spark jobs through Apache Livy
This is the main piece fo code which creates,monitors and deletes livy session
This code is used to create a presto connection and then execute queries
Stores user credentails
Shows an example of how to connect to Livy and submit a job
This is the pyspark code which is called in pyspark_submitjob.py and submitted to a spark livy. The file is stored in a S3 bucket and ouput of this job is also stored iin a bucket.
Example of an airflow dag which cshows an ETL flow of connecting to Presto, where a query is executed and a date is returned. The date is passed to another task via XCOM, where it is passed as an arguement to a Spark job which is then submitted to Livy
This ia a pyspark job which is submitted to Livy via the airflow dag