Create a multiple node DAG for GCP
hannelita opened this issue · 4 comments
Using the same structure of the multiple node DAG assuming local files, create a DAG that handles resources on the cloud (CSV and JSON).
Acceptance
- DAG with multiple nodes doing the entire pipeline on GCP.
Tasks
- Use aircan from Pypi
- [ ] Refactor code to reuse existing nodesMoved to #60
DAG tasks:
- 1.
Upload CSV from CKAN instance to bucketRead remote CSV - 2. delete_datastore
- 3. create_datastore
- 4. Creake JSOn file on bucket
- 5. convert_csv_to_json
- 6. Send converted JSON file to CKAN
ckannext-aircan (connector) tasks:
- [x] 1. create endpoint to receive Airflow response after processing
- [x] 2. Handle Airflow response
- [ ] 3. If success, download processed json file from bucket
NOTE: This is not the strategy. We will send the processed JSON via API.
Analysis
After this long task is complete, we still need to:
- [ ] Handle errors This will be in the next milestone
- [ ] Handle absence of a response from Airflow This will be in the next milestone
- [ ] Delete remote file (create a separate DAG for that) This will be in the next milestone
Before everything, try to specify remote file locations for csv and json
Remote CSV fetching works
Problem with enoding when reading remote file on Airflow DAG