datopian/aircan

Create a multiple node DAG for GCP

hannelita opened this issue · 4 comments

Using the same structure of the multiple node DAG assuming local files, create a DAG that handles resources on the cloud (CSV and JSON).

Acceptance

  • DAG with multiple nodes doing the entire pipeline on GCP.

Tasks

  • Use aircan from Pypi
    - [ ] Refactor code to reuse existing nodes Moved to #60

DAG tasks:

  • 1. Upload CSV from CKAN instance to bucket Read remote CSV
  • 2. delete_datastore
  • 3. create_datastore
  • 4. Creake JSOn file on bucket
  • 5. convert_csv_to_json
  • 6. Send converted JSON file to CKAN

ckannext-aircan (connector) tasks:
- [x] 1. create endpoint to receive Airflow response after processing
- [x] 2. Handle Airflow response
- [ ] 3. If success, download processed json file from bucket

NOTE: This is not the strategy. We will send the processed JSON via API.

Screen Shot 2020-07-13 at 07 33 35

Analysis

After this long task is complete, we still need to:
- [ ] Handle errors This will be in the next milestone
- [ ] Handle absence of a response from Airflow This will be in the next milestone
- [ ] Delete remote file (create a separate DAG for that) This will be in the next milestone

Before everything, try to specify remote file locations for csv and json

Remote CSV fetching works

Problem with enoding when reading remote file on Airflow DAG

Resolved with #53