Glue job local development using Python

This project is a sample project shows how to develop and test AWS Glue job on a local machine to optimize the costs and have a fast feedback about correct code behavior after doing any code change.

We will analyze movie's data calculating the weighted average and selecting top 10 most popular movies.

You can download input files movies.csv and ratings.csv datasets from this location.

It is demonstrating:

Project structure
Unit testing
Packaging with the versioning of the source code including dependent libraries

Project structure

This project contains 2 Spark jobs with generic Spark driver implementation.

├──driver                    <-- Directory contains a Spark driver file
│   ├──__init__.py
│   └── main.py              <-- Spark driver generic implementation
├── tasks                    <-- Directory contains Spark jobs multiple implementations
│   ├── __init__.py
│   ├── common.py
│   └── jobs.py
└── tests                    <-- Directory contains unit tests and the subset of input data used for the testing purposes
    ├──__init__.py
    ├── samples
    │   ├── movies.csv
    │   └── ratings.csv
    ├── test_average_job.py  <-- Unit test of movie ratings average calculation
    └──test_top10_job.py     <-- Unit test of top 10 movies calculation
├── setup.cfg
├── setup.py
├── pytest.ini               <-- Environment variables used by pytest
├── requirements.txt
├── test-requirements.txt
└── README.md

Unit tests

The idea of the unit test is to validate the business logic correctness using pre-generated data as an input and comparing the output to the expected result.

There is a unit test implementation in tests.test_average_job.py for example.

To be able to implement unit tests there is a need to isolate a code related to a business logic of Spark job from read/write operations. You can clearly see this isolation in tasks.jobs.create_and_stage_top_movies for example.

Testing the read/write operations and integration with Glue runtime environment is done as a part of the integration test usually implemented with CI/CD pipeline.

Walkthrough

Prerequisites

Install Python 3.8.2 or later
Install Java

Download Glue 1.0 Python libraries

Follow the instructions on Glue documentation Developing Locally with Python

cd ~
mkdir glue-libs
cd glue-libs
wget https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz
tar -xvzf spark-2.4.3-bin-hadoop2.8.tgz

Checkout aws-glue-libs repo glue-1.0 branch

cd ~/glue-libs
git clone https://github.com/awslabs/aws-glue-libs.git -b glue-1.0

Build `AWSGlueETLPython-1.0.0.jar` using Maven

cd aws-glue-libs
mvn package

Once Maven task is finished, jar will be created here ~/aws-glue-libs/target/AWSGlueETLPython-1.0.0.jar

Create and activate Python virtual environment

python3 -m venv .venv
source .venv/bin/activate

Install dependencies

pip install --upgrade pip
pip install -U -e  .
pip install -r requirements-test.txt

Configure environment variables in the `pytest.ini` file.

Some important config options:

PYTHONPATH: Configure <YOUR HOME DIR>/glue-libs/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python:<YOUR HOME DIR>/glue-libs/aws-glue-libs/target/AWSGlueETLPython-1.0.0.jar:$PYTHONPATH
JAVA_HOME: Spark requires Java 8 currently (on Linux: /usr/lib/jvm/java-1.8.0-openjdk-amd64)

Once you're ready with the configuration you can run the test-set:

pytest -s -m 'not integration'

Package the source code as `wheel` package

python setup.py build -vf && python setup.py bdist_wheel

Package dependencies separately to be able to configure Referenced files path as a part of Glue job deployment

pip wheel -r requirements.txt -w dist/dependencies

To clean all generated directories and files

python setup.py clean_all

License

This library is licensed under the MIT-0 License. See the LICENSE file.

learn2excel/aws-glue-local-development