google/transit

Best Practice Suggestion: Permalink to GTFS feeds should be on same domain as the transit agency's main website

evansiroky opened this issue · 7 comments

User Story

As a consumer of hundreds of different GTFS feeds,
I would like to avoid having to update our database of which URL to download a transit agency's feeds from,
So that I can consistently download each transit agency's most up-to-date data even if they change their internal GTFS publishing process.

Description of Problem

We have had a significant amount of problems downloading current data from transit agencies due to them changing the URL(s) from which to download their GTFS data. In a number of cases, we have failed to discover out-of-date feeds sometimes for months. Transit agencies can and frequently do change vendors that publish their GTFS data and when that happens, they often will have new URLs that come with the new vendors. Sometimes, the transit agencies may update their "GTFS Page" showing a list of URLs, but it should not be necessary for consumers to monitor such pages for changes and it may be difficult to automate such a process to check for this anyways.

Proposal

I propose to add additional language to the best practices/specification that says that a public transit operator should publish their permalinks to GTFS data on the same domain as their own website and not be exclusively using a vendor's website domain. In following this practice, transit agencies should update their permalinks to redirect to whichever URL is currently serving the most up-to-date version of their GTFS data.

Does not make sense to me. Since the agency does not have to be the initiator of the GTFS publication in the first place. What you want is called a "National Access Point" where dataproviders are mandatory to register their dataset with the available metadata, works well in Europe, makes sence in the rest of the world.

+1 This will be very useful!

Since the agency does not have to be the initiator of the GTFS publication in the first place.

That's why it would be best practice not a requirement. I agree that this would be useful not only for URL stability issues but also for many of the issues I've heard about making sure to be using the "official" schedule that the agency wants you to...since there are quite a few that have more than one floating around.

I would like to avoid having to update our database of which URL to download a transit agency's feeds from,
So that I can consistently download each transit agency's most up-to-date data even if they change their internal GTFS publishing process.

My reasoning is that your user-story should be resolved in a better way, not by scraping agency websites.

Hello @skinkie. Thanks for your feedback. In this case my organization (The State of California) maintains a similar thing as you describe as a national access point. We maintain our own list of GTFS datasets which we publish here: https://data.ca.gov/dataset/cal-itp-gtfs-ingest-pipeline-dataset/resource/e4ca5bd4-e9ce-40aa-a58a-3a6d78b042bd

We manually maintain those URLs as best as we can because this our only option at this time. We frequently run into issues of having outdated data because we are not required to be notified by the transit agencies when they update their data. While there is now a mandate in our country for most transit agencies to provide their URLs, they are only mandated to do so for GTFS Schedule data and not realtime. They also only report this once a year at most to a federal agency. Furthermore, the availability of these URLs from the federal agency is something we are uncertain whether we will have access to.

At this time, the creation of a mandate to have URLs reported is outside of our control. And even if there were a mandate, there may be transit agencies that forget to provide their most up-to-date URL when they change vendors. Or they may only report it as a requirement once a year thus creating a potentially large gap of time between when they change their URLs. And on top of that, there may be an additional gap between when the data is reported and when the agency that the URL is reported to makes the reported URLs available other organizations such as ours.

Given all of this, I still recommend creating this best practice to aid with feed aggregators and entities producing a national access point, but also for direct data consumers as well.

In Europe the NAP model has solved this issue in a handful of countries so far. The political effort that went into clarifying the source of truth in each of these is commendable and a huge achievement. It might be helpful to provide guidance on how that has been achieved places like Austria, Netherlands, and Norway.

Meanwhile many NAPs including Germany are hosting numerous overlapping datasets. Other entire NAPs are offline or lacking transit data entirely. In addition to what @evansiroky shared, this all seems to indicate we shouldn't only depend on centralized management being achieved universally, either from a regulatory or resourcing standpoint.

I don’t know about calling this a best practice as there are factors that may not make this the recommended course of action, and I’m skeptical that advocating for the use of an agency’s official URL would do much to guarantee more stability. Whether for organizational or funding reasons, sometimes an agency’s URL does not match their name (e.g., “ECCOG” vs “Outback Express”), agencies rebrand and change their name or URL, procure new websites or merge with others. An agency may choose not to publish GTFS with their agency domain for a variety of reasons. Trillium publishes feeds at data.trilliumtransit.com, oregon-gtfs.com, etc., many for small agencies or cities that don’t have the capability to publish data at their own domain or their website content management system might pose barriers (e.g., an unavoidable automation that changes the suffix every time a new file is uploaded). Establishing the use of agency domains as a best practice seems a bit restrictive given the breadth of circumstances an agency might be under that steer them toward a different approach. I can understand advocating for as much agency control as possible over how and where their data is published, but that doesn’t really seem to be the topic of this discussion.

Looking at the user story…

…avoid having to update our database of which URL to download a transit agency’s feeds from,
So that I can consistently download each transit agency’s most up-to-date data even if they change their internal GTFS publishing process

The direct factor in avoiding having to constantly update a database’s fetch URLs is simply that those URLs don’t change, regardless of what the domain might be. But this is already a best practice; those agencies (and vendors) with constantly changing URLs are just not following it. Apart from these cases, though, URLs still change for a variety of legitimate reasons. So is there an alternative to mitigate this pain point other than by creating an additional best practice? Perhaps this is where the establishment of something like https://database.mobilitydata.org/ as a single source of truth could come into play…?

I would also be interested to hear from other consumers on this pain point.