Build batch processing pipelines to do behavior analytics with AWS.
Apache Airflow, AWS (S3, EMR, IAM, Redsfhit), Docker, and PySpark
- Docker
- PostgreSQL
- AWS Account
- AWS CLI Configured
This code run in Windows 10 (Powershell 5.1) while the reference [1] run in linux/macOS (bash shell).
To setup the infrastructure needed, run infra_setup.ps1 with bucket_name
as an argument.
Example : infra_setup.ps1 name-of-bucket
Make python environment using venv and install requirement packages
# Make
python -m venv bp-env
# Activate
.\bp-env\Scripts\Activate.ps1
# Install
pip install -r requirements.txt
To destroy infrastructure, run infra_teardown.ps1 with bucket_name
as an argument.
Example : infra_teardown.ps1 name-of-bucket
[1] Data Engineering Project for Beginners - Batch edition
[2] Getting started with Amazon Redshift Spectrum
[3] Changing PowerShell's default output encoding to UTF-8
[4] authorize-security-group-ingress
[6] StepConfig