break apart transform_warehouse DAG to better reflect cadence needs
Opened this issue · 1 comments
After Littlepay's recent adjustment to their publishing cadence to better suit our analytics needs, we found that the new publishing time was too late for our transform_warehouse
DAG start time and was making data stale. In #3290, we move the transform_warehouse
DAG start time forward 4 hours ( from 10:00 to 14:00 UTC) to improve the data freshness, but this makes all data transformations happen later in the morning which is not ideal.
We need to break apart the transform_warehouse
DAG so that models that need to be run later in the morning (payments) are run at 14:00, and all of the other models run at the previous time (10:00 UTC).
A notes doc for an initial meeting about this effort is available here, but the project was deprioritized in favor of handoff tasks following that first meeting.
Larger Job overview:
Break up jobs into buckets:
- GTFS-RT
- GTFS Static/schedule
- GTFS quality - run it after everything else
- Payments (perhaps earliest in the mornign)
- Amplitude
Harder
- Sequence of dependencies
- Determine the timings
-> Need to research, figure out times consumer needs data by
E.g. littepay -> 6am EST,
Look at source files when we get files each morning, elevaon, etc.
Some come from APIs but most are file drops
Breaking up of tasks is not a big deal
-> Create new transform models
Modify the daily dag, on the whole,run turn into multiple transform tasks -> then sequential rather than simultaneous