cal-itp/gtfs-aggregator-checker

Refactor to use transit.land v1 API

Closed this issue · 1 comments

The current code seems to use a combination of a GraphQL API and scraping to check for the presence of feeds. In order to make sure we are responsibly querying the data, we should be using their API to check for feed presence.

It seems like the way this can be done is by making a query to get all operators in California using this URL: https://api.transit.land/api/v1/operators?&apikey=API_KEY&limit=1000&sort_key=id&sort_order=asc&state=US-CA&total=true. In the response, we can iterate through each operator and collect the values in represented_in_feed_onestop_ids. Then, for each of those values, we can make a request to https://api.transit.land/api/v1/feeds/FEED_ID?apikey=API_KEY and check that response for either the value(s) in the url or urls field.

TLDR; The transitland API v1 had a lot of problems. The v2 api works great. PR tomorrow.

I switched to the transitland api and the preliminary findings aren't great.

  • Using the scraping method (current head of this repo's main branch) matches 37/120 urls.
  • Using the above url (state=US-CA) gives 29/120 urls. I looked into which urls are missing and they are for four domains: 206.128.158.171, max.availtec.com, presidiobus.com, and redondobeachbct.com
  • On a whim, I extended this to all states (removing US-CA) and that number came to 34/120 urls. It is still missing max.availtec.com (Modesto area express). If I manually got to the modesto feed api url it exists but says "operators_in_feed":[] which is why it diddn't appear in the operators endpoint.
  • Going from Califormnia to all locations caused a lot of other problems.
    • There's an undocumented throttle so the script starts returning 429 errors and has to be stopped and restarted.
    • ~100 of the feed_ids fail out as 500 errors. Simply typing in a gibberish feed gives a 404 error, so this is some deeper data integrity problem on their part.

So as it stands the v1 api gives inferior results and is a bit more unwieldy. I'm going to try their v2 api and see if that's any better.

20 minutes later: v2 gives the same results (37/120) as scraping the site. It also can be done in 8 requests (by getting 500 feeds at a time). I gotta run, but I'll polish this up first thing tomorrow and have a PR before start of day PST.