cal-itp/data-infra

Bug: LA Metro Bus schedule data missing in May 2024

Closed this issue · 11 comments

Describe the bug
LA Metro Bus's schedule feed was valid until 5/8/24. We don't have a current feed for them.

To Reproduce

# find the base64_url for LA Metro
SELECT DISTINCT
  date,
  feed_key,
  gtfs_dataset_name,
  base64_url
 FROM `cal-itp-data-infra.mart_gtfs.fct_daily_schedule_feeds` 
 WHERE gtfs_dataset_name LIKE '%LA Metro%' AND date BETWEEN '2024-05-01' and '2024-05-20' 

# Use that base64_url to check in dim_schedule_feeds to see if there's anything current
SELECT DISTINCT
  base64_url,
  _is_current,
  _dt,
  _valid_from,
  _valid_to
 FROM `cal-itp-data-infra.mart_gtfs.dim_schedule_feeds` 
 WHERE base64_url = "aHR0cHM6Ly9naXRsYWIuY29tL0xBQ01UQS9ndGZzX2J1cy9yYXcvbWFzdGVyL2d0ZnNfYnVzLnppcA=="
 ORDER BY _valid_to DESC

Expected behavior

  • All operators have current feed in place
  • We'd know about this from the expired feeds dashboard, and we'd also know when it's resolved
  • Track this issue here because we're holding open data and other analytics pipeline for May 2024 in place until we plug LA Metro back in (too big of an operator to miss)

LA Metro was accidentally marked as deprecated and removed from the pipeline on 5/3, thus 5/4 onwards it does not show up on the fct_daily_schedule_feds.

Given this notice, on 5/21 it was added back.

I'll do a little research into what it takes to backfill the data from 5/4 to 5/20.

@vevetron: Here's the docs pages:

My notes so far.

I don't know if a full refresh will fix the issue. I'm investigating how our pipeline works. I wanted to post my notes so far before they get lost:

This runbook might be useful:
https://github.com/cal-itp/data-infra/blob/main/runbooks/data/schedule_backfill.ipynb

Lifecycle of a GTFS

Select * FROM cal-itp-data-infra.mart_gtfs.dim_schedule_feeds`
where key = '608992664173210532aa3e6cc573be2f'

The feed key is at dim_schedule_feeds
Valid from:
2024-04-11 03:00:27.705573 UTC
2098-12-31 23:59:59.999999 UTC
true

https://github.com/cal-itp/data-infra/blob/main/airflow/dags/create_external_tables/airtable/external_airtable_california_transit_gtfs_datasets.yml

-> Pulls the data from airtable including field data_quality_pipeline
Pushes it to:
california_transit__gtfs_datasets/*.jsonl.gz"
destination_project_dataset_table: "external_airtable.california_transit__gtfs_datasets"

One idea - download a date - go into these jsonl and copy it, unzip, edit the jsonl for la metro bus and make it good, then zip and upload it again.

table: ClassVar[str] = "california_transit__gtfs_datasets"

-> AirtableGTFSDataExtract(PartitionedGCSArtifact):

generate_gtfs_download_configs.py
https://github.com/cal-itp/data-infra/blob/50a7afd79755c81af90ca945152b83197dbbbd21/airflow/dags/airtable_loader_v2/generate_gtfs_download_configs.py

Will skip records that aren't in the pipeline, and where do they put these records?
configs.jsonl.gz

Saves the configs to:
gs://calitp-gtfs-download-config/gtfs_download_configs/dt=2024-05-23/ts=2024-05-23T02:00:31.036354+00:00/configs.jsonl.gz

airtable_loader_v2 ->
https://console.cloud.google.com/storage/browser/_details/calitp-gtfs-download-config/gtfs_download_configs/dt%3D2024-05-22/ts%3D2024-05-22T02:00:32.694833%2B00:00/configs.jsonl.gz;tab=live_object?project=cal-itp-data-infra

Next idea - download these configs.jsonl.gz, and add la metro back to it, and reupload it.

[2024-05-23, 03:08:46 UTC] {storage.py:263} INFO - saving 1.0 MB to gs://calitp-gtfs-schedule-raw-v2/schedule/dt=2024-05-23/ts=2024-05-23T03:03:58.124882+00:00/base64_url=aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1NQQ==/GTFSTransitData_MA.zip
Okay so download_schedule_feeds puts gtfs into this buckket.

https://github.com/cal-itp/data-infra/blob/50a7afd79755c81af90ca945152b83197dbbbd21/airflow/dags/download_gtfs_schedule_v2/download_schedule_feeds.py
https://console.cloud.google.com/storage/browser/calitp-gtfs-schedule-raw-v2/schedule/dt%3D2024-05-12/ts%3D2024-05-12T03:03:16.418007%2B00:00?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&project=cal-itp-data-infra&prefix=&forceOnObjectsSortingFiltering=false

# find the base64_url for LA Metro
SELECT DISTINCT
  date,
  feed_key,
  gtfs_dataset_name,
  base64_url
 FROM `cal-itp-data-infra.mart_gtfs.fct_daily_schedule_feeds` 
 WHERE gtfs_dataset_name LIKE '%LA Metro%' AND date BETWEEN '2024-05-01' and '2024-05-20' 

Okay. So it's 5/28. There are entries for everyday for LA metro Bus from 5-01 to 5-28.

The only change I made was updating airtable to it correct settings - re-entering La Metro Bus into the pipeline.

I don't know why the data populated after the Sunday full refresh.

Maybe we should try to figure out why one day.

The GTFS download configs (e.g. from 2024-05-15) are still missing "LA Metro Bus Schedule".

# find the base64_url for LA Metro
SELECT DISTINCT
  date,
  feed_key,
  gtfs_dataset_name,
  base64_url
 FROM `cal-itp-data-infra.mart_gtfs.fct_daily_schedule_feeds` 
 WHERE gtfs_dataset_name LIKE '%LA Metro%' AND date BETWEEN '2024-05-01' and '2024-05-20' 

Okay. So it's 5/28. There are entries for everyday for LA metro Bus from 5-01 to 5-28.

The only change I made was updating airtable to it correct settings - re-entering La Metro Bus into the pipeline.

I don't know why the data populated after the Sunday full refresh.

Maybe we should try to figure out why one day.

  • frequent enough full refreshes on schedule data! #2568
  • I'm digging for a note or a doc or an issue about how if a feed was marked as deprecated, but it was still valid, i think once you fix it in airtable, the dates that were missing due to the error come back in and populate it.
    • because that feed is reinstated, and the dates are unpacked from dim_ (that row would appear here in the expanded date range). ex: it was marked as _valid_from=1/1/2024 and _valid_to=5/21/2024 (incorrectly), but now you fixed it to be _valid_to=5/31/2024.
    • fct_ (pulls from all the dates available in dim_ and creates a row for every date). previously it would have populated up to 5/21/24, so if we looked at the warehouse on 5/24/24, we'd be missing 3 dates. but now that more dates are available, if, looked at it on 5/24/24, those 3 dates would get reinstated.

@vevetron: are you able to mark data_quality_pipeline as True in Airtable for LA Metro? I think the nulls are what's throwing it off in my work. Right now, it's a null, and that gets filtered out.

SELECT
  key,
  name,
  base64_url,
  data_quality_pipeline,
  _valid_from,
  _valid_to,
 FROM `cal-itp-data-infra.mart_transit_database.dim_gtfs_datasets` 
 WHERE name LIKE '%LA Metro%' AND type = 'schedule'

There is an element of is it easier to fix it in the analysis or in the raw data. Overall it probably makes more sense to fix this in the raw data in the long term, especially if it’s impacting analysis.

I need to confirm but I can probably edit the data in storage that I described in the ticket and assuming the full refresh captures those changes, this badness would go away.

I’d probably want @evansiroky to authorize that I do this.

Going Backwards from cal-itp-data-infra.mart_transit_database.dim_gtfs_datasets
Already in this table
SELECT * FROM cal-itp-data-infra.staging.int_transit_database__gtfs_datasets_dim
where name LIKE '%LA Metro Bus%'
and
data = 'GTFS Schedule'
order by _valid_from
LIMIT 10

We see LA Metro has been deprecated from 5-04 to 5-21

https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.stg_transit_database__gtfs_datasets

SELECT * FROM cal-itp-data-infra.staging.stg_transit_database__gtfs_datasets
WHERE name LIKE '%LA Metro Bus Schedule%' order by dt;
This has 712 entries

It pulls from:

once_daily_gtfs_datasets AS (
{{ get_latest_dense_rank(
external_table = source('airtable', 'california_transit__gtfs_datasets'),
order_by = 'ts DESC', partition_by = 'dt'
) }}
),

This table has 18 entries with a deprecated date
These ts, dt
2024-05-04 02:00:40.581775 UTC 2024-05-04
2024-05-05 02:00:39.550561 UTC 2024-05-05
2024-05-06 02:00:43.096643 UTC 2024-05-06
2024-05-07 02:00:32.995638 UTC 2024-05-07
2024-05-08 02:00:45.736814 UTC 2024-05-08
2024-05-09 02:00:44.648763 UTC 2024-05-09
2024-05-10 02:00:37.497296 UTC 2024-05-10
2024-05-11 02:00:37.364785 UTC 2024-05-11
2024-05-12 02:00:40.137432 UTC 2024-05-12
2024-05-13 02:00:30.571785 UTC 2024-05-13
2024-05-14 02:00:24.426160 UTC 2024-05-14
2024-05-15 02:00:39.729454 UTC 2024-05-15
2024-05-16 02:00:43.476834 UTC 2024-05-16
2024-05-17 02:00:35.705332 UTC 2024-05-17
2024-05-18 02:00:43.591283 UTC 2024-05-18
2024-05-19 02:00:43.806615 UTC 2024-05-19
2024-05-20 02:00:41.190153 UTC 2024-05-20
2024-05-21 02:00:30.009125 UTC 2024-05-21

provider operator gtfs_service_mapping are also blank

After this date it doesn't have a deprecated date.

This references:
https://dbt-docs.calitp.org/#!/source/source.calitp_warehouse.airtable.california_transit__gtfs_datasets#code
this references
from cal-itp-data-infra.external_airtable.california_transit__gtfs_datasets.

Thus if we fix the airtable archives of gtfs datasets, at least the cal-itp-data-infra.mart_transit_database.dim_gtfs_datasets will trigger and get corrected. But I don't know if it will create a discrepancy with the download gtfs jobs.

image
I ran a script and it added new fixed airtable extracts to :
calitp-airtable

Processing object: california_transit__gtfs_datasets/dt=2024-05-02/ts=2024-05-02T02:00:42.220495+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-04/ts=2024-05-04T02:00:40.581775+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-04/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-04/ts=2024-05-04T02:00:40.581775+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-04/ts=2024-06-11T03:17:36.814359+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-05/ts=2024-05-05T02:00:39.550561+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-05/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-05/ts=2024-05-05T02:00:39.550561+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-05/ts=2024-06-11T03:17:39.963752+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-06/ts=2024-05-06T02:00:43.096643+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-06/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-06/ts=2024-05-06T02:00:43.096643+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-06/ts=2024-06-11T03:17:43.250301+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-07/ts=2024-05-07T02:00:32.995638+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-07/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-07/ts=2024-05-07T02:00:32.995638+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-07/ts=2024-06-11T03:17:46.387455+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-08/ts=2024-05-08T02:00:45.736814+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-08/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-08/ts=2024-05-08T02:00:45.736814+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-08/ts=2024-06-11T03:17:49.514518+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-09/ts=2024-05-09T02:00:44.648763+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-09/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-09/ts=2024-05-09T02:00:44.648763+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-09/ts=2024-06-11T03:17:52.696261+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-10/ts=2024-05-10T02:00:37.497296+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-10/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-10/ts=2024-05-10T02:00:37.497296+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-10/ts=2024-06-11T03:17:56.104758+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-11/ts=2024-05-11T02:00:37.364785+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-11/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-11/ts=2024-05-11T02:00:37.364785+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-11/ts=2024-06-11T03:17:59.440818+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-12/ts=2024-05-12T02:00:40.137432+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-12/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-12/ts=2024-05-12T02:00:40.137432+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-12/ts=2024-06-11T03:18:02.847242+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-13/ts=2024-05-13T02:00:30.571785+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-13/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-13/ts=2024-05-13T02:00:30.571785+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-13/ts=2024-06-11T03:18:05.887538+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-14/ts=2024-05-14T02:00:24.426160+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-14/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-14/ts=2024-05-14T02:00:24.426160+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-14/ts=2024-06-11T03:18:08.969917+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-15/ts=2024-05-15T02:00:39.729454+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-15/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-15/ts=2024-05-15T02:00:39.729454+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-15/ts=2024-06-11T03:18:12.158021+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-16/ts=2024-05-16T02:00:43.476834+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-16/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-16/ts=2024-05-16T02:00:43.476834+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-16/ts=2024-06-11T03:18:15.287720+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-17/ts=2024-05-17T02:00:35.705332+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-17/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-17/ts=2024-05-17T02:00:35.705332+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-17/ts=2024-06-11T03:18:18.174707+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-18/ts=2024-05-18T02:00:43.591283+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-18/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-18/ts=2024-05-18T02:00:43.591283+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-18/ts=2024-06-11T03:18:21.259633+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-19/ts=2024-05-19T02:00:43.806615+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-19/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-19/ts=2024-05-19T02:00:43.806615+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-19/ts=2024-06-11T03:18:24.306259+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-20/ts=2024-05-20T02:00:41.190153+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-20/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-20/ts=2024-05-20T02:00:41.190153+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-20/ts=2024-06-11T03:18:27.379073+00:00/gtfs_datasets.jsonl.gz Processing object: california_transit__gtfs_datasets/dt=2024-05-21/ts=2024-05-21T02:00:30.009125+00:00/gtfs_datasets.jsonl.gz Saved to: california_transit__gtfs_datasets/dt=2024-05-21/./gtfs_datasets_processed.jsonl.gz california_transit__gtfs_datasets/dt=2024-05-21/ts=2024-05-21T02:00:30.009125+00:00/gtfs_datasets.jsonl.gz Uploaded as: california_transit__gtfs_datasets/dt=2024-05-21/ts=2024-06-11T03:18:30.699738+00:00/gtfs_datasets.jsonl.gz

Crossing fingers that it worked.

SELECT
  key,
  name,
  base64_url,
  data_quality_pipeline,
  _valid_from,
  _valid_to,
 FROM `cal-itp-data-infra.mart_transit_database.dim_gtfs_datasets` 
 WHERE name LIKE '%LA Metro Bus%' AND type = 'schedule'
 order by _valid_from;

looks good, no nulls. It's odd the schedule has an update on 5/22 but it's probably okay. LA metro repo doesn't have any changes during this period. Probably an artifact of deprecating then undeprecating.