Project 3

Run pipeline

  1. Install dependency

    pip3 install -r requirement
  2. Run cluster

    docker-compose up -d
  3. Install airflow

    export AIRFLOW_HOME=/working/dir/airflow/dags
    ./install-airflow.sh
  4. Run pipeline

    airflow standalone

Crawl data

  1. change NAME_NODE_ID in /crawler/constants.py by name node container id.

  2. Run bash

    cd crawler
    python3 crawler.py

Run spark application

  1. change containerId in /client/run.sh by spark container id.

  2. change fileName in /client/run.sh by application want run.

  3. Run bash

    /client/run.sh