cal-itp/data-infra

Bug: zero trips showing up in fct_scheduled_trips for select agencies

Closed this issue · 9 comments

Describe the bug

There are zero trips in the fct_scheduled_trips table for:

  • B-Line after May 17
  • Long Beach after June 14

To Reproduce

Expected behavior

There should be scheduled trips appearing in this table for all active dates.

Additional context

B-Line and Long Beach use calendar-dates-only style for publishing their schedule. However, I'm not sure if this is the cause of the problem.

This query seems okay:
SELECT * FROM cal-itp-data-infra.staging.stg_transit_database__gtfs_datasets
WHERE name LIKE '%B-Line Schedule%' order by dt DESC;

But...
B-line appears to have a new GTFS with dates that start on 20240519.

By this query:
SELECT
key,
name,
base64_url,
data_quality_pipeline,
_valid_from,
_valid_to,
FROM cal-itp-data-infra.mart_transit_database.dim_gtfs_datasets
WHERE name LIKE '%B-Line%' AND type = 'schedule'
order by _valid_from;
We use gtfs that starts on 3-12.

Maybe we are still using the previous gtfs and haven't properly ingested the new one?

Also affected is Sacramento Schedule, cuts out around the same date 5/13.
image

Ends up in the reports website as well:
image

So there are a decent number of stop_times that didn't get processed, while the stops did get processed.

If you compare the count of the unique feeds used for stops vs the unique feeds used for stop times, there is an incongruity.

with s as (
  select base64_url, count(distinct(feed_key)) as feed_key from cal-itp-data-infra.mart_gtfs.dim_stops 
  group by base64_url
),
st as (
select count(distinct(st.feed_key)) as feed_key, st.base64_url
FROM cal-itp-data-infra.mart_gtfs.dim_stop_times st
group by base64_url
)
select s.feed_key, st.feed_key as st_key_count, s.base64_url, 
CASE
    WHEN REGEXP_CONTAINS(s.base64_url, r'^[A-Za-z0-9+/=]*$')  -- Check for valid Base64 characters
    THEN SAFE_CONVERT_BYTES_TO_STRING(FROM_BASE64(s.base64_url))
    ELSE NULL -- Or handle the invalid case differently (e.g., return an error message)
  END AS decoded_column
from s
left join st
on s.base64_url = st.base64_url;

image

There is a slight pattern that makes me think the error is cause by this issue:
#3368

So i've thought of two potential fixes for this issues:

  • Wait till #3368 is fixed
  • Finding the bad airflow jobs and have them re-run in production either manually or scripted. I'm not sure how to do this but it's probably doable
    Pros: Should fix all the bad airflow jobs at the same time
    Pros: Airflow records will be fixed as well
    Cons: Slow, Hard

Another option:

and manually re-reun the processing / zipping.

And do the same for the shapes that might be missing.

Pro: potentially easier. faster.
Cons: We'd still have bad airflow records but the data would be complete.

Based on a discussion with @mjumbewu , we'll hold off on option two and wait for #3368 to get resolved. Then we will go into the airflow console and manually re-run the failed jobs. There seems to only be about 15ish so this is doable.

I reran all the failed jobs (slowly, painfully, manually) , I got all to work but 2 (which consistently fail). I did a query to see if this fixed things but I think I need to wait for a Sunday Full Refresh for that data to populate.

Not sure if it's worth it to fix those last two.

A better query.

WITH s AS (
  SELECT base64_url, COUNT(DISTINCT(feed_key)) AS feed_key 
  FROM cal-itp-data-infra.mart_gtfs.dim_stops 
  GROUP BY base64_url
),
st AS (
  SELECT COUNT(DISTINCT(st.feed_key)) AS feed_key, st.base64_url
  FROM cal-itp-data-infra.mart_gtfs.dim_stop_times st
  GROUP BY base64_url
)
SELECT 
  s.feed_key, 
  st.feed_key AS st_key_count,
  s.feed_key - st.feed_key as difference,
  s.base64_url, 
  CASE
      WHEN REGEXP_CONTAINS(s.base64_url, r'^[A-Za-z0-9+/=]*$')  -- Check for valid Base64 characters
      THEN SAFE_CONVERT_BYTES_TO_STRING(FROM_BASE64(s.base64_url))
      ELSE NULL -- Or handle the invalid case differently (e.g., return an error message)
  END AS decoded_column,
  IF(s.feed_key = st.feed_key, 1, 0) AS key_match -- Added statement
FROM s
LEFT JOIN st
ON s.base64_url = st.base64_url
ORDER by difference DESC;

The data looks waaay better now. 18 agencies with 1 unprocessed GTFS.
image

I spoke with @vevetron and I think this one can be marked as closed as we should have all needed data now.

Both B-Line and Long Beach are resolved for this issue, Sacramento is not. Sacramento's issue is likely because they publish their GTFS a head of schedule without merging in the previous GTFS, and we don't handle that scenario.