A walkthrough to deploy an exemple of AWS Glue/Spark Job, s3 and Athena pipeline.
Throughout the infra
folder has a Python/CDK project to create everything that you need to run this example.
infra/assets
folder has the glue script and data from Chicago Crimes.
- AWS Account
- AWS CLI v2
- Python 3.8+
- CDK version 2.89.0
- Configure your AWS Credentials.
- Browse to
infra
folder:$ cd glue-athena-cdk-example/glue
- Create and activate a virtual env:
$ python3 -m venv .venv $ source .venv/bin/activate $ pip3 install -r requirements.txt
- Bootstrap CDK (Optional if you already did):
$ cdk bootstrap
- Synth and deploy stack
$ cdk synth $ cdk deploy
In summary, after deploy, you can start the Glue Job, it will take about 2 minutes to run.
A table will be created on default database with data from processed s3 stage bucket parquet file, that you can query using Athena.