catalyst-cooperative/pudl-usage-metrics

Create a github action that processes a partition of data and loads to GCP

bendnorman opened this issue · 0 comments

Create a github action that processes a partition of data and loads the new data to a Cloud SQL instance or a Bigquery dataset.

There are a couple of parts to this:

  • Create a GCP resource to store the data. A postgres Cloud SQL instance gives us all the benefits of a relational database but is more expensive than BigQuery ($10 - $60 a month depending on the CPU). Various Cloud SQL pricing options.
  • Create a cloud SQL dagster Resource for the github action to use.
  • Create a github action that runs the datasette ETL for a single partition. Ideally, we would deploy dagster on GCP using GCE or kubernetes but we could also run the ETL in github actions. I think deploying on Kubernetes is the best way to go but it's a bit daunting. I think we need to deploy a couple of long-running services (dagit, daemon, a database for dagster) and run launchers are responsible for allocating compute resources for jobs.

For the sake of time I think I'm going to start with just running dagster commands in a github action. Example command:

dagster job execute -m usage_metrics -c weekly_run.yml

where weekly_run.yml looks something like this:

ops:
 extract:
  config:
   end_date: "2022-02-06"
   start_date: "2022-01-30"

end and start date would have to be determined in bash and piped into yaml which is kind of janky:

CURRENT_DATE=$(date "+%Y-%m-%d")
echo "   start_date: /"$CURRENT_DATE/"" >> month_run.yml