DenysGonzaga/glue-athena-cdk-example

A small walkthrough how to create an AWS Glue Job Pipeline with AWS CDK

PythonMIT

Simple AWS Glue, Athena and CDK Example

A walkthrough to deploy an exemple of AWS Glue/Spark Job, s3 and Athena pipeline.

Throughout the infra folder has a Python/CDK project to create everything that you need to run this example.

infra/assets folder has the glue script and data from Chicago Crimes.

How To

Requisites

AWS Account
AWS CLI v2
Python 3.8+
CDK version 2.89.0

Architecture Design

Example Setup

Configure your AWS Credentials.
Browse to infra folder:
```
$ cd glue-athena-cdk-example/glue 
```

Create and activate a virtual env:

$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip3 install -r requirements.txt

Bootstrap CDK (Optional if you already did):
```
$ cdk bootstrap
```
Synth and deploy stack
```
$ cdk synth
$ cdk deploy
```

In summary, after deploy, you can start the Glue Job, it will take about 2 minutes to run.

A table will be created on default database with data from processed s3 stage bucket parquet file, that you can query using Athena.

Resources

More information about CDK setup.
Whether you don't know how to configure AWS CLI, check here
2023 Chicago's Crime dataset was downloaded from here.