cal-itp/data-infra

Schedule weekly full refresh for non-RT data

Closed this issue · 1 comments

In the course of re-running recent data post #2562, I encountered a case where a few DAG parsing tasks had been re-run in early April with code changes that changed the outputs of the parse job but a full refresh had not been run, so the parsed data in GCS was broken and had diverged from the associated mart table.

This led me and @atvaccaro to discuss whether we should be running full refreshes of schedule data more often so that it's less possible for these kinds of gaps to emerge and fester silently. We make frequent enough code changes that affect incremental schedule tables that a weekly (perhaps Sunday morning) full refresh of schedule data only (not RT for cost) would probably be beneficial.

AC:

  • Sunday morning dbt run is a full refresh (excluding GTFS RT data) instead of just a normal run

This would probably still be useful. Leaving open.