google/transit

[GTFS-Flex] Replace areas.txt/stop_areas.txt with locations.geojson MultiPoint feature to describe collections of stops?

westontrillium opened this issue · 12 comments

The use case of completely on-demand stops, also known as "point deviation"–routes with a collection of stops as the service area that a rider can be picked up/dropped off at in any order within a timeframe–is currently covered with a stop_times.stop_id referencing an area_id containing multiple stops.

Transit (consumes Flex), has expressed concern over the use of areas.txt/stop_areas.txt for Flex services due to the potential of deep loops between stops and stop_areas.txt references to add unnecessary complexity. They have proposed the alternative of including MultiPoints as a possible locations.geojson feature to describe collections of stops.

Are there any concerns with such a change? Questions I have are:

  1. Is a MultiPoint feature containing information duplicative of a stop(s) in stops.txt problematic (i.e., a stop may be described in two different places, but with a different primary key)?
  2. If a MultiPoint feature referred to geolocations already referenced in stops.txt, would that cause complications with consumers parsing which trips to return?
  3. If question 1 or 2 is the case, is there some needed id relationship to stops.txt, akin to what @e-lo has brought up in past conversations?
  4. Is cataloguing points/stops in a file format other than .csv opening pandoras box even more?

The two other use cases for including areas.txt/stop_areas.txt in Flex data are discussed in this issue; there I posit that these can be covered without areas.txt/stop_areas.txt.

I'm not that concerned with trip planning aspects, but more so with other aspects of passenger information, such as departure boards, online/printed timetables etc, which rely on a single stop id being used cosistently. I think that each physical stop should be represented with an unique identifier inside a feed, and its usage should be mandated everywhere.

I do not understand the concern of deep loops, as at least with location groups, as they were called previously, could only reference either stops or locations, not other location groups. That way there could never be any nesting. I'm not sure if this was overlooked when going from using separate groups to the areas.txt from fares-v2.

An example stop that would be affected is https://www.ostgotatrafiken.se/hallplats/bergslagstorget which is served by regular lines 181 and 182, as fell as flexible services FI01-FI08, which all cover an area and up to five stops.

Someone from Transit should confirm. I believe part of their concern was situations where "stopB" is doing all of this:
Screenshot 2023-08-16 at 12 25 27 PM

@hannesj Do you think it could be worth discussing the possibility of reverting back to something (exactly?) like location_groups.txt to describe stop collections, minus polygons? The original justification for switching to areas.txt/stop_areas.txt to describe groups of flexible zones/stops was that those files were already part of Fares v2 and offered the same functionality as location_groups.txt.

But doesn't that diagramm just show a well connected graph? Sure, you can do silly things that will be hard/impossible to compute but isn't that the responsibility of the producer?

If you have stop areas that may make computing fares ambigious, should you not create a separate area one just for the flex service?

Edit: Now that I said it "shifting responsibility to the producer" is probably asking for trouble.

From a producer point of view, the geojson feature alternative gives me pause just because we'd also need to deal with points that exist in stops.txt and geojson features. I'm also not sure we want to start representing stops data in a file other than stops.txt.

It would be possible to use GeometryCollections instead of MultiPoint to allow for each point of a collection of stops to refer to a stop_id to capture metadata like stop_name/code, but then you're still having to refer to several files (stop_times>locations.geojson>stops versus stop_times>areas>stop_areas*>stops).

Instead of a foreign key relationship, you could just add stop_name/stop_code fields to each feature in the GeometryCollection, but it just seems strange to me to reconstruct data that already exists elsewhere instead of just referencing it.

Either of these solutions are more burdensome for a producer than what is already in the spec.

*A location_groups equivalent would be one less step, for what it's worth.

npaun commented

Let me try to draw @westontrillium's diagram in separate stages, to illustrate our concern with the current implementation of location groups and polygonal stops.

Existing GTFS features

Screenshot 2023-08-22 at 11 47 43 AM

stop_areas.stop_id and stop_times.stop_id are foreign keys referencing stops.txt, and one would have the expectation that all fields named stop_id relate to to stops.txt in some way.

Current state of GTFS Flex proposal

Screenshot 2023-08-22 at 12 13 09 PM

Things have gotten a bit complicated:

  • stop_times.stop_id is now a special type of key referring to either stops.stop_id or stop_areas.area_id or the id of a Feature in locations.geojson.
  • stop_areas.stop_id now refers to either stops.stop_id or a Feature's id. (Maybe it could also refer to stop_areas.area_id for consistency with stop_times.stop_id - but now we have a cyclic data structure.... errrh...)
  • Some other fields like say stops.parent_station or transfer.from_stop_id continue to refer only to stops.txt? Not sure.

Also, we've duplicated certain fields like stops.stop_name and a Feature's stop_name.

Transit's proposal

Screenshot 2023-08-22 at 11 55 07 AM
  • We propose that the foreign key relation to stops.txt be preserved.
  • A new location_type=5 is introduced for Flex areas (final name TBD).
    • If location_type=5 then stops.location_id is conditionally required, and refers to a Feature's id.
    • stops.stop_lat and stops.stop_lon are conditionally forbidden. These fields are already optional for some location_types so it isn't a breaking change.
  • Metadata is removed from locations.geojson, so stops.stop_name is the only place to name a stop, for example.
  • Features could have MultiPolygon geometry (for service areas), or MultiPoint goemetry (to replace location groups)

Outstanding issues

  • For MultiPolygon, we believe our proposal neatly solves a lot of difficult to implement parts of the existing spec.
  • Unfortunately, for MultiPoint (location groups), it introduces problems of its own: the members of location groups aren't stops and don't have their own metadata anymore.
    • We can allow a limited amount of metadata by treating location groups as GeometryCollections.
  • At this point we turn to y'all for input. Together, can we brainstorm a way to handle location groups with minimal complexity?

I understand the desire to simplify referencing, but I really do not like idea of needing to maintain identical data for single stops in two different places (locations.geojson and stops.txt). Thinking of some alternatives to weigh this against...

Just triple checking an assumption I've had, is there really no precedent for changing a column in the spec from "Required" to "Conditionally Required", or is that truly not considered a backwards-compatible change? Flex already changes the Conditional Requirement of arrival_time: "- Required for the first and last stop in a trip (defined by stop_times.stop_sequence)"...

Because if we could do that to stop_times.stop_id, that could solve the issue of it referencing stop_id, location id, or area_id. Instead, we could have new columns in stop_times for directly referencing a location id or area_id (location_group_id, or whatever), and that record could exclude the now conditionally required stop_times.stop_id. This is what I believe @flocsy touched on in reviewing the Flex PR.

I hesitate to include this, as it's thinking waaay outside the box (I'm trying everything here!), but if we can't get around the required stop_times.stop_id, would it be possible to add an "array" type column to stops.txt to have a stops.txt record reference multiple stop_ids as a "stop group?" The individual column could have its "arrayed" values separated by a space, a pipe, or even be in a JSON-like bracketed array format. So you would have in stops.txt:

stop_id location_type stop_group_array
group1 5 [or 6] stopA stopB stopD stopG

Then in stop_times.txt:

trip_id stop_id stop_sequence
weekday group1 1

Fully acknowledging this is highly unorthodox and likely an impossibility. At the very least, it was a good thought exercise for me :)

npaun commented

Those are both interesting ideas, @westontrillium.

stop_times.area_id

Off the top of my head, I think this would be a valid approach. I'd need to think about this more with my team though.

location_type=6

We can mechanically transform stop_group_array into a single-valued column (see the diagram below), if we want to avoid introducing new data types. How do you feel about the result --is it something worth thinking about further?

Screenshot 2023-08-22 at 4 11 24 PM

@westontrillium to your point, I think there is plenty of precedent for making changes to the spec that generally break backwards compatibility for feed consumers (as opposed to feed producers), though of course, we try to avoid it if we can. But GTFS-Flex is going to be one giant breaking change for feed consumers no matter how you slice it :)

To your specific point, there is precedent for changing an existing Required field to "Conditionally Required" if we can define a reasonable condition. And having an area_id or location_id specified could reasonably be that condition.

I like the direction that this discussion is taking. I've long felt that the foreign key relationships (or lack thereof) where not as "tight" as they could be, but couldn't actually put my finger on it. Keeping all data other than the actual geometry in stops.txt is a good move to me.

I think new location_types are a good idea and thought about this previously. I don't think it's technically required but it will probably be very useful for consumers who have never heard of flex and gives them something to Google.

However, I think that using MultiPoints or any other collection types in locations.geojson is a bad idea. So is introducing a a collection column.

If you don't want to lump stop_areas and location groups together I would prefer going back to an explicit location_groups.txt.

At this point, it looks like we're discussing two distinct options, yes?

  1. Add stop_times.location_id and stop_times.location*_group_id columns, make stop_times.stop_id conditionally required. stop_times.txt still directly references locations.geojson and location_groups.txt, but each have their own foreign key column in stop_times.txt.
  2. Add a stops.location_type=5 and stops.location_type=6 for GeoJSON Polygons/MultiPolygons and location groups, respectively, add stops.location_id and stops.location_group_id columns which references an associated locations.geojson or location_groups.txt value. stop_times.stop_id can reference a stop_id that in turn references a location_id or location_group_id.

As a producer, I prefer Option 1, as it is much simpler to implement. Option 2 has more reference steps, requires creating more data (since you'd need to generate a stops.txt entry for each location/location group), and is more burdensome to maintain longterm due to the requirement of sustaining parity between a record's metadata in stops.txt and its locations.geojson/location_group data. These issues would be compounded for smaller producers.

*If this is amended to only refer to stops, should the name change to something else, or do we leave the potential to be able to include other location types later...?

npaun commented

We prefer Option 2, because it ensures that the linkage to locations.geojson and location_groups.txt only exists in one place, stops.txt. With Option 1, we'd need to add these linkages to at least stop_times.txt and stop_areas.txt, with more cases theoretically possible in the future.

Option 2 also keeps metadata, such as the stop name, in a single place for all of regular stops, entrances, stations, generic nodes, boarding areas, polygon stops and location groups.

That said, we think both Option 1 and Option 2 represent a significant improvement over the status quo.

After a discussion with @tzujenchanmbd, I am closing this issue, it has been included in #433