Use static GTFS data and Python to find "true" end stations (at the end of a line and with no connections) for subway (U-Bahn) and suburban rail (S-Bahn) lines in Berlin.
Even though it would be easier to just look at a map.
My friend Lou has a project. They want to visit all the end stations on the Berlin S-Bahn and U-Bahn system. However, to count as a truly satisfying end station, the station must not only be at the end of its own line, but must also lack an interchange to another line. Lou wants only the end-iest of end stations.
I want to find the end stations and present them to Lou in a format they can use for their project.
Berlin provides public transport data in the now-standard GTFS format, which stands for General (formerly Google) Transit Feed Specification. It is an open standard used by transit agencies to present their data in a uniform und useable way. It encompasses two types of data: realtime and static.
- The realtime data is what drives (pun intended) our on-the-go transit apps and notifies us of delays and changes.
- The static data (also known as schedule data) tells us and our apps about how all the points on the network are linked together. This is the GTFS data that will allow us identify the end stations.
GTFS is a data format in its own right, meant for "GTFS-consuming applications". However, conveniently, it is also easily readable in other software environements because it is merely a ZIP comprised of CSV (as TXT) files. The exact tables provided vary by agency. The following are the minimum required table files for static GTFS:
stops.txt
showing stops' name and locationstop_times.txt
showing every stop on every individual triptrips.txt
showing trips' route, direction, destination* (headsign), and moreroutes.txt
showing routes' transportation type, number, line colour, and moreagency.txt
My GTFS download for Berlin also contains these conditionally required and optional tables:
frequencies.txt
(note: empty),calendar.txt
andcalendar_dates.txt
showing service frequency, day and date availabilitypathways.txt
andlevels.txt
showing platform positions inside stationsshapes.txt
showing a trip's path, expressed as a sequence of co-ordinatestransfers.txt
showing the connections between stations and details about these transfers
GTFS does not contain a simple list of stops
on route
. Instead, stop_ID
s appear as sequences of stop_times
in individual trips
, which in turn belong to various routes
. We need to join the IDs of stops, trips and routes to the (giant) stop_times.txt
.
I would therefore need to combine the following tables and fields to find the end stations if I were to make a database:
stop_times.txt
for the sequence of stops
- "trip_id" (primary key 1) (foreign key of trips.trip_id)
- "stop_sequence" (primary key 2)
- "arrival_time"
- "departure_time"
- "stop_id" (foreign key of stops.stop_id)
- "pickup_type"
- "drop_off_type"
transfers.txt
for the transfers between different routes (or in our case, the lack thereof)
- "from_stop_id" (primary key 1) (foreign key of stops.stop_id)
- "to_stop_id" (primary key 2) (foreign key 2 of stops.stop_id)
- "from_route_id" (primary key 3) - might not be needed
- "to_route_id" (primary key 4) - might not be needed
- "transfer_type"` see docs
- "min_transfer_time"
trips.txt
for matching the trip IDs to routes
- "trip_id" (primary key)
- "route_id" (foreign key of routes.route_id)
- "trip_headsign" *
- "trip_short_name" (can be route number but appears not to be used for train numbers here)
- "direction_id"
stops.txt
for matching the stop IDs to stops
- "stop_id" (primary key)
- "stop_name"
- "stop_lon"
- "stop_lat"
* the "headsign" destination is not to be confused with the end of a line; some trips do not travel the whole route. In our data this is a text field rather than an ID.
There are many packages designed to handle GTFS data in Python, created by various developers. These are mainly used for geographical and frequency data, but I found three that I may be able to use to find the end stations in Python, more easily than by manually creating a database:
Package | Data Structure | Potential Functions | |
---|---|---|---|
1. | gtfs_kit |
Pandas | gtfs_kit.stops.get_stops and gtfs_kit.validators.check_transfers |
2. | transit_service_analyst |
Pandas | get_line_stops_gdf() |
3. | gtfs_utils |
Pandas | route_stats and trip_stats |
I also noted down a few tools for SQL, in case the above didn't do what I needed:
node-gtfs
SQLitegtfs-via-postgres
PostgreSQLgtfs-schema
Schema buildergtfsdb
Works with various SQL
After testing the various packages, gtfs_kit
was the one that met my data requirements the closest. This module uses Pandas and Shapely to manipulate the GTFS data. read_feed
loads our GTFS data into an instance of the module's Feed
class, ignoring non-GTFS files and stripping whitespaces from column names as it does so.
Knowing that the Berlin data is using Extended GTFS journey types, I extract the following route_type
s directly by number, from the result of get_routes
:
- 109: Suburban Railway. Examples: S-Bahn (DE)
- 400: Urban Railway Service
- Note: In Berlin, not "402: Underground Service", despite the Extended specification giving U-Bahn as an example of this type