Build complex workflows with Amazon MWAA,AWS Step Functions ,AWS Glue and Amazon EMR

Important: this application uses various AWS services and there are costs associated with these services after the Free Tier usage - please see the AWS Pricing page for details. You are responsible for any AWS costs incurred. No warranty is implied in this example.

Code repo structure

├── README.MD                   <-- The instructions file
├── dags/mwaalib                <-- Reusable code for Amazon EMR and AWS Step Functions
├── setup                       <-- Source code for initial setup
│   └── transform/              <-- Pre processing pyspark code and resuable code.     
│   └── template.yaml           <-- Template for basic application setup
│   └──               <-- Deploy Script 


  • AWS CLI already configured with Administrator permission




  1. AWS Account .Create an AWS account if you do not already have one and login.

  2. Amazon Managed Workflow for Apache Airflow environment in supported region.Create environment if you do not have one. Note us-west-2 is selected. Change the region, if required.

  3. IAM permissions for the MWAA Execution role for S3 ,EMR, Step Functions and AWS Systems Manager Parameter Store.

    iam:PassRole on EMR_DEFAULT_ROLE
    iam:PassRole on EMR_EC2_ROLE

A sample Policy is provided as an example. Verify and edit the Account Number to your AWS Account Number. Create and Attach the Policy to the Amazon MWAA execution role.

Refer to this link for Adding and removing IAM identity permissions.

A sample role yaml is also provided if you do not have EMR_DEFAULT_ROLE and EMR_EC2_ROLE already created. Run the Cloudformation template to create EMR Roles

Installation Instructions

  2. Clone the repo onto your local development machine using git clone.

  3. From the command line, change directory into the setup folder, then run:

    ./ -s <MWAA Airflow Dag Bucket Name> -d <Demo Data Bucket Name>

    Replace <MWAA Airflow Dag Bucket Name> with the MWAA Airflow S3 Bucket

    Replace <Demo Data Bucket Name> with any bucket you want to use.

    Modify the stack-name or bucket parameters as needed. Wait for the stack to complete.

  4. Wait for the script to complete. You should see the following logs.

    Waiting for stack update to complete ...
    Finished create/update successfully!
    upload: ./ to s3://mwaa-dl-demo-us-east-1/scripts/glue_jobs/movielens/
    upload: transform/ to s3://mwaa-dl-demo-us-east-1/scripts/
    upload: transform/ to s3://mwaa-dl-demo-us-east-1/scripts/
    upload: transform/ to s3://mwaa-dl-demo-us-east-1/scripts/

Post Installation Checks

  1. Verify the resources created by the Cloudformation template.
  2. Verify that Amazon MWAA execution role has additional policy attached.
  3. The deploy script creates a Glue Database and 2 crawlers. If you have Lakeformation enabled, please make sure to add the LF database grant to the crawler.

AWS resources :

Following stacks are created by the above process

  1. mwaa-demo-foundations - Contains the foundational resources and services
    • Glue Database - mwaa-movielens-demo-db
    • Glue Crawlers - Crawlers to catalog the data.
    • Lambda Functions - To invoke Glue jobs and check status from Step Functions
    • LambdaRole - Lambda role for Step1 and Step2
    • SSM Parameters - SSM parameters for resources to be used by all services.
    • Step Functions - Movie Lens Step function

AWS resources created based on DAG Run:

  1. EMR Cluster


