planarnetwork/dtd2mysql

Missing stops when train reverses movement direction / cif location suffixes

Closed this issue · 2 comments

mk-fg commented

When checking resulting GTFS feed for train_uid Q21052 against southeasternrailway.co.uk API, found this curious mismatch:

Diff details:
  Matching journey trip [ gtfs -vs- api ]:
  ...
  TWI 18:22:00          TWI 18:22:00
  STW 18:25:00          STW 18:25:00
  TED 18:29:00          TED 18:29:00
  HMW 18:32:00          HMW 18:32:00
  KNG 18:42:00          KNG 18:42:00
                      > HMW 18:44:00
                      > TED 18:46:00
  FLW 18:51:00          FLW 18:51:00
  HMP 18:54:00          HMP 18:54:00
  ...

Note how at KNG train apparently reverses direction and passes two previous stops (in reverse order) then goes off somewhere else.

This is also indicated by "RN" activity flag for KNG stop (see #14 for more details on these) in the CIF data, and both MCA file and data imported into MySQL (via --timetable) has these reverse-stops in there, but with a "suffix":

+----------+----------+----------+
| location | ts_arr   | ts_dep   |
+----------+----------+----------+
...
| SHCKLGJ  | NULL     | NULL     |
| TEDNGTN  | 18:28:00 | 18:29:00 |
| HAMWICK  | 18:31:00 | 18:32:00 |
| KGSTON   | 18:34:00 | 18:42:00 |
| HAMWICK2 | 18:44:00 | 18:44:00 |
| TEDNGTN2 | 18:46:00 | 18:46:00 |
| SHCKLGJ2 | NULL     | NULL     |
| FULWELL  | 18:51:00 | 18:51:00 |
...

Such numeric suffix seem to be a part of CIF specification, as per page 21 of "CIF USER SPEC v29 FINAL.pdf" (see #14 for URL) or page 14 of RSPS5046.

Both basically say that CIF field should be "tiploc + suffix", but it seem to be parsed without normalization (splitting suffix into its own value) and resulting field used with JOIN as tiploc when building GTFS, which it technically isn't.

Not sure if this "suffix" is ever useful, as given that stops are sequential anyway, it can be easily derived if necessary, so I'd suggest dropping it entirely in the parser, by using first 7 chars of that field, ignoring the 8th one.
Or, for completeness, maybe it can be stored in a separate db field on --timetable operation.

Doing either of these should fix produced GTFS data in such "multiple passes through same stop(s)" cases.

So I guess my join fails because it's including the suffix in the tiploc_code... not quite sure how to sort that at the moment as tiploc is variable length. Might see if there are any tiplocs with numeric values and see if I can strip those out.

Ah, there are tiplocs with numbers in them, but the last character is always reserved for that suffix so I can just take that.