cal-itp/data-infra

break apart transform_warehouse DAG to better reflect cadence needs

Opened this issue · 1 comments

After Littlepay's recent adjustment to their publishing cadence to better suit our analytics needs, we found that the new publishing time was too late for our transform_warehouse DAG start time and was making data stale. In #3290, we move the transform_warehouse DAG start time forward 4 hours ( from 10:00 to 14:00 UTC) to improve the data freshness, but this makes all data transformations happen later in the morning which is not ideal.

We need to break apart the transform_warehouse DAG so that models that need to be run later in the morning (payments) are run at 14:00, and all of the other models run at the previous time (10:00 UTC).

A notes doc for an initial meeting about this effort is available here, but the project was deprioritized in favor of handoff tasks following that first meeting.

Larger Job overview:
Break up jobs into buckets:

  • GTFS-RT
  • GTFS Static/schedule
  • GTFS quality - run it after everything else
  • Payments (perhaps earliest in the mornign)
  • Amplitude

Harder

  • Sequence of dependencies
  • Determine the timings
    -> Need to research, figure out times consumer needs data by
    E.g. littepay -> 6am EST,
    Look at source files when we get files each morning, elevaon, etc.
    Some come from APIs but most are file drops

Breaking up of tasks is not a big deal
-> Create new transform models
Modify the daily dag, on the whole,run turn into multiple transform tasks -> then sequential rather than simultaneous