
POCing the MOOC stack

Primary LanguagePython

IMDB Dataset

The raw dataset from Kaggle is pre-processed to be made more lightweight using references/imdb_dataset/prepare_from_raw.py. Make sure you changed the code constants to match with the path where you stored the raw dataset before running the code. By default, it will expect a data folder located alongside itself.

Steps followed in this POC

aws configure --profile my_aws_account
  • Create an SSH key in .pem format:
ssh-keygen -m PEM
  • Install Terraform with tfswitch
  • Add Terraform files to .gitignore
  • Deploy the Terraform config (input your public SSH key and current IP)

  • Connect to the instance using the output EC2 instance public IP with:

ssh -i <path to your .pem key> ec2-user@<public_ip>
  • Install prerequisites with the following script:
curl -sSf 'https://raw.githubusercontent.com/datamindedbe/academy_mooc_poc/main/references/airflow/pre-setup.sh' | bash -
  • Logout / Login to make permissions active
  • Run docker info to make sure you can access docker commands without sudo. Run docker-compose --version

Set up Airflow:

  • Run the Airflow stack:
docker-compose up airflow-init && docker-compose up
  • Wait a few minutes and try to reach <EC2 public IP>:8080. Login with airflow as username and password.
  • Trigger the smoke_test_dag to make sure everything has been set up correctly.