Introduction

This workshop is for data engineers, data scientists and data analysts, whose job includes developing and managing ETL (extract, transform and load) process. The participants should have some knowledge of shell scripting, ETL, Python and data management.

The workshop takes about 3-5 hours to complete end-to-end. After the workshop, you will have a high-level understanding about AWS Glue, Step Function and MWAA. You will also have insight into the capability and targeted use cases for each of the covered services and the development process involved to use the service.

The workshop contains the following sections:

  • How to Start?: Set up an AWS environment for the workshop.

  • Lab 01: Introduction to Apache Spark, and how to use PySpark and Glue-flavored PySpark to develop Glue ETL (extract, transform, load) code and use 3rd party Python libraries in Glue.

  • Lab 02: Introduction to AWS Step Functions, a low-code visual workflow service used to orchestrate AWS services. You will also learn how to create a simple event driven data processing pipeline.

  • Lab 03: In this lab, you will learn basics of Apache Airflow by answering a few questions like what is Airflow, why we need Airflow, key concepts and components.