Bug: zero trips showing up in fct_scheduled_trips for select agencies
Closed this issue · 9 comments
Describe the bug
There are zero trips in the fct_scheduled_trips table for:
- B-Line after May 17
- Long Beach after June 14
To Reproduce
Expected behavior
There should be scheduled trips appearing in this table for all active dates.
Additional context
B-Line and Long Beach use calendar-dates-only style for publishing their schedule. However, I'm not sure if this is the cause of the problem.
This query seems okay:
SELECT * FROM cal-itp-data-infra.staging.stg_transit_database__gtfs_datasets
WHERE name LIKE '%B-Line Schedule%' order by dt DESC;
But...
B-line appears to have a new GTFS with dates that start on 20240519.
By this query:
SELECT
key,
name,
base64_url,
data_quality_pipeline,
_valid_from,
_valid_to,
FROM cal-itp-data-infra.mart_transit_database.dim_gtfs_datasets
WHERE name LIKE '%B-Line%' AND type = 'schedule'
order by _valid_from;
We use gtfs that starts on 3-12.
Maybe we are still using the previous gtfs and haven't properly ingested the new one?
So there are a decent number of stop_times that didn't get processed, while the stops did get processed.
If you compare the count of the unique feeds used for stops vs the unique feeds used for stop times, there is an incongruity.
with s as (
select base64_url, count(distinct(feed_key)) as feed_key from cal-itp-data-infra.mart_gtfs.dim_stops
group by base64_url
),
st as (
select count(distinct(st.feed_key)) as feed_key, st.base64_url
FROM cal-itp-data-infra.mart_gtfs.dim_stop_times st
group by base64_url
)
select s.feed_key, st.feed_key as st_key_count, s.base64_url,
CASE
WHEN REGEXP_CONTAINS(s.base64_url, r'^[A-Za-z0-9+/=]*$') -- Check for valid Base64 characters
THEN SAFE_CONVERT_BYTES_TO_STRING(FROM_BASE64(s.base64_url))
ELSE NULL -- Or handle the invalid case differently (e.g., return an error message)
END AS decoded_column
from s
left join st
on s.base64_url = st.base64_url;
There is a slight pattern that makes me think the error is cause by this issue:
#3368
So i've thought of two potential fixes for this issues:
- Wait till #3368 is fixed
- Finding the bad airflow jobs and have them re-run in production either manually or scripted. I'm not sure how to do this but it's probably doable
Pros: Should fix all the bad airflow jobs at the same time
Pros: Airflow records will be fixed as well
Cons: Slow, Hard
Another option:
- Find where pairs of stop_times.txt doesn't match up to a stop_times.jsonl.gz
and manually re-reun the processing / zipping.
And do the same for the shapes that might be missing.
Pro: potentially easier. faster.
Cons: We'd still have bad airflow records but the data would be complete.
I reran all the failed jobs (slowly, painfully, manually) , I got all to work but 2 (which consistently fail). I did a query to see if this fixed things but I think I need to wait for a Sunday Full Refresh for that data to populate.
Not sure if it's worth it to fix those last two.
A better query.
WITH s AS (
SELECT base64_url, COUNT(DISTINCT(feed_key)) AS feed_key
FROM cal-itp-data-infra.mart_gtfs.dim_stops
GROUP BY base64_url
),
st AS (
SELECT COUNT(DISTINCT(st.feed_key)) AS feed_key, st.base64_url
FROM cal-itp-data-infra.mart_gtfs.dim_stop_times st
GROUP BY base64_url
)
SELECT
s.feed_key,
st.feed_key AS st_key_count,
s.feed_key - st.feed_key as difference,
s.base64_url,
CASE
WHEN REGEXP_CONTAINS(s.base64_url, r'^[A-Za-z0-9+/=]*$') -- Check for valid Base64 characters
THEN SAFE_CONVERT_BYTES_TO_STRING(FROM_BASE64(s.base64_url))
ELSE NULL -- Or handle the invalid case differently (e.g., return an error message)
END AS decoded_column,
IF(s.feed_key = st.feed_key, 1, 0) AS key_match -- Added statement
FROM s
LEFT JOIN st
ON s.base64_url = st.base64_url
ORDER by difference DESC;
The data looks waaay better now. 18 agencies with 1 unprocessed GTFS.
I spoke with @vevetron and I think this one can be marked as closed as we should have all needed data now.
Both B-Line and Long Beach are resolved for this issue, Sacramento is not. Sacramento's issue is likely because they publish their GTFS a head of schedule without merging in the previous GTFS, and we don't handle that scenario.