Create a github action that processes a partition of data and loads to GCP
bendnorman opened this issue · 0 comments
bendnorman commented
Create a github action that processes a partition of data and loads the new data to a Cloud SQL instance or a Bigquery dataset.
There are a couple of parts to this:
- Create a GCP resource to store the data. A postgres Cloud SQL instance gives us all the benefits of a relational database but is more expensive than BigQuery ($10 - $60 a month depending on the CPU). Various Cloud SQL pricing options.
- Create a cloud SQL dagster Resource for the github action to use.
- Create a github action that runs the datasette ETL for a single partition. Ideally, we would deploy dagster on GCP using GCE or kubernetes but we could also run the ETL in github actions. I think deploying on Kubernetes is the best way to go but it's a bit daunting. I think we need to deploy a couple of long-running services (dagit, daemon, a database for dagster) and run launchers are responsible for allocating compute resources for jobs.
For the sake of time I think I'm going to start with just running dagster commands in a github action. Example command:
dagster job execute -m usage_metrics -c weekly_run.yml
where weekly_run.yml
looks something like this:
ops:
extract:
config:
end_date: "2022-02-06"
start_date: "2022-01-30"
end and start date would have to be determined in bash and piped into yaml which is kind of janky:
CURRENT_DATE=$(date "+%Y-%m-%d")
echo " start_date: /"$CURRENT_DATE/"" >> month_run.yml