This is an example of a big data pyspark pipeline deployment.
-
Build the Docker Compose
docker-compose build
-
Run Docker Compose
docker-compose up
-
Open Airflow at localhost:8080, go to Admin/Connections and set a new connection:
- conn_id: 's3_default'
- S3 type connection
- Login : your-access-key
- Password : your-secret-key
- Extra: {"region_name":"us-east-1"}
-
Assuming you've spun up your EMR cluster, go to 'titanic_training_emr' DAG, switch on and run
-
Then, go to 'titanic_prediction_emr' DAG, switch on and run
-
Stop your Docker Compose and free resources
docker-compose down