When an input file in .csv or .json format is uploaded to the ‘input s3 bucket’, it trigger a Lambda function running as a docker container that reads the file into a data frame. The script uses awsdatawrangler python library to perform transformation to manipulate the data, convert the data to parquet format and writes the output paequet file to an ‘Output s3 bucket’. Then it trigers a glue crawler to update the Glue catalog with the metadata. an Athena tables are created on top of the processed s3 files to enable users to run analytical queries on the dataset.
- AWS CLI
- VALID AWS ACCESS CREDENTIALS
- PYTHON
- Docker
- Clone the project repository.
- Navigate to the project directory.
- pip install requirements.txt
- cdk bootstrap
- cdk synth
- cdk deploy
- use
python scripts/upload_to_s3.py
to upload your raw data to input s3 bucket. - Wait for around 2 minutes
- use
python scripts/run_athena_query.py
to query your processed data in output s3 bucket through Athena. feel free to modify the sql query inrun_athena_query.py
to suite your needs.
This will remove all resources created by this project from you AWS account
- cdk destroy
This project is licensed under the MIT License. See the LINCENSE file for details.