transitland/transitland-datastore

Duplicated stops

slvlirnoff opened this issue · 10 comments

Hi all,

I've running in an issue where for some feeds it seems that the stops get duplicated a lot over time, i.e. every time the feeds get a new version. It seems that all stops are created again.

You can see it here (NZ feed) where in this small bounding box you have ~107 stops that represents around two in reality:

https://transit.land//api/v1/stops?bbox=174.87233519554138%2C-36.93862462178051%2C174.88262951374054%2C-36.936056263854915&departure_time=&isochrone_mode=&isochrones_mode=&onestop_id=&pin=-36.93735974847771%2C174.87571125850081&served_by=&per_page=false

After a while it becomes a problem when fetched into valhalla because it reaches max_transit_stop or other contraints like this one for a tile. Is there a way to delete all 'old entities' that aren't used in an active feed version somewhere? I guess here valhalla keep adding these stops because most of them are served by the current route from latest feed.

Best,
Cyprien

It seems not to be duplicating the stops, but generating/creating platforms with the original stop as parent_stop, for some reason. e.g

onestop_id | "s-rckmg7u8cz-businesspde~highbrookdr<2090"
parent_stop_onestop_id | "s-rckmg7u8cz-businesspde~highbrookdr"
onestop_id | "s-rckmg7u8cz-businesspde~highbrookdr<2073"
parent_stop_onestop_id | "s-rckmg7u8cz-businesspde~highbrookdr"

Maybe the stop_id or lat/lon fields in the feed are changing with every feed update?

I think it all comes down to the feed versions after this one

https://api.transit.land/api/v1/feed_version_imports?feed_onestop_id=f-rck-gowest%7Esealinkgroup%7Eatairporter%7Eatmetro%7Ethepartybuscompany&feed_version_sha1=0200ec23a1ed2435dc5a8d76283b008af46d37ce

For some reason, the stop_id's on that one feed version, and all other feed versions that followed it, have dates attached to them, which creates a new stop/platform, because the stop_id is different in each feed because of the attached date. At least it seems to be keeping the parent_stop the same.

So, I would say it's a bug in the feed and not in the transitland datastore software, which seems to be working as intended, IMO. Probably the only problem in transitland was importing those feeds when they shouldn't because they have a problem(dates attached to stop_id's, but maybe there's a reason for those dates being there, who knows?).

Thank for investigating @Rui-Santos indeed and you can see the stop id schema here (in the archive details): https://transit.land/dispatcher/feed-versions/7e1609712ced7f7ecded0fe7deefdc6bddb68497 or here https://transit.land/dispatcher/feed-versions/270399f70f59764a945b6a063a3e7a4dae53c05d (not sure why your link open a caltrain version).

What maybe is a problem for me is that I would expect providers, stops or routes and schedules not referenced in any active version of a feed to not be returned by the API: i.e. in this case, despite the feed not following the best practice to aim to keep consistent stop_id from one version to another, old stops generated from previous feeds versions that aren't the active feed version should be filtered/deleted. Why does the api keep returning these?

Why does the api keep returning these?

Probably has something to do with this, which is at the link below

The FeedMaintenance service within Transitland Datastore automatically decides when to import a newly fetched feed version. If no need feed version is available when existing ScheduleStopPairs are about to expire, the FeedMaintenance service will extend them into the future.

https://transit.land/documentation/datastore/feeds.html#active-feed-version

I don't think that it is related.

What I meant is that for feed X with fetched feed version 1, 2, 3 (and all of these feed version properly imported). Stops and routes imported through an anterior feed version (i.e. not from the active feed version) shouldn't be returned in the API.

In this case, if version 1,2,3 had the stops: stopA.date-1, stopA.date-2, stopA.date-3 I would expect only stopA.date-3 to be returned by the API. Otherwise it's a bit complex for the client to filter out old stops and routes.

Looking at the link you posted originally, all the 44 stops(didn't you say they were 107?) in there have the same updated_at date, which is the date of the active feed version. Is the datastore updating the entities even when they are not present in the feed? That would be a bug. Or are those stops still referenced in the feed and that is the reason they are being updated instead of just dropped? I can confirm that they are not in the Stops.txt and in the API response some of them have empty operators_serving_stop and routes_serving_stop.

Indeed 44 sorry! sorry, 107 was on my local setup of the datastore (I do have more of the feed version imported in total)

This bounding box is easier to analyse:

http://transit.land/api/v1/stops?bbox=174.89569187164307%2C-36.95153799103548%2C174.90599155426023%2C-36.946779334738444&departure_time=&isochrone_mode=&isochrones_mode=&onestop_id=&pin=-38.187466178077905%2C175.14507830142975&served_by=&per_page=false

Here's the list of stops id:

 s-rckmgctrn1-ladyrubydr~barmacpl
 s-rckmgctrn1-ladyrubydr~barmacpl<2399
 s-rckmgctrn1-ladyrubydr~barmacpl<2082
 s-rckmgctrn1-ladyrubydr~barmacpl<2399~20170925130042v5824
 s-rckmgctrn1-ladyrubydr~barmacpl<2082~20170925130042v5824
 s-rckmgctrn1-ladyrubydr~barmacpl<2399~20170918164808v5816
 s-rckmgctrn1-ladyrubydr~barmacpl<2082~20170918164808v5816
 s-rckmgctrn1-ladyrubydr~barmacpl<2399~20170928152758v593
 s-rckmgctrn1-ladyrubydr~barmacpl<2082~20170928152758v593
 s-rckmgctrn1-ladyrubydr~barmacpl<2082~20171003151834v597
 s-rckmgctrn1-ladyrubydr~barmacpl<2399~20171003151834v597
 s-rckmgctrn1-ladyrubydr~barmacpl<2082~20171013114012v5918
 s-rckmgctrn1-ladyrubydr~barmacpl<2399~20171013114012v5918
 s-rckmgctrn1-ladyrubydr~barmacpl<2082~20171027130346v5940
 s-rckmgctrn1-ladyrubydr~barmacpl<2399~20171027130346v5940
 s-rckmgctrn1-ladyrubydr~barmacpl<2399~20171114122843v6013
 s-rckmgctrn1-ladyrubydr~barmacpl<2082~20171114122843v6013
 s-rckmgctrn1-ladyrubydr~barmacpl<2399~20171113160906v6012
 s-rckmgctrn1-ladyrubydr~barmacpl<2082~20171113160906v6012
 s-rckmgctrn1-ladyrubydr~barmacpl<2399~20171201125012v6025
 s-rckmgctrn1-ladyrubydr~barmacpl<2082~20171201125012v6025

And the list of parent stop ids

null
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl
s-rckmgctrn1-ladyrubydr~barmacpl

In one of the latest feed version zip here's what I found for this stop (https://transitland-gtfs.s3.amazonaws.com/datastore-uploads/feed_version/7e1609712ced7f7ecded0fe7deefdc6bddb68497.zip)

stop_lat  ,zone_id   ,stop_lon  ,stop_id                     ,parent_station ,stop_desc ,stop_name              ,location_type,stop_code
-36.94856 ,merged_27 ,174.89883 ,31781-20171201125012_v60.25 ,               ,          ,Lady Ruby Dr/Barmac Pl ,1            ,31781    
-36.94856 ,merged_26 ,174.89883 ,31781-20171113160906_v60.12 ,               ,          ,Lady Ruby Dr/Barmac Pl ,1            ,31781    

The two stop could correspond to the 4 last stops returned by the API. I failed to see why it translate to 4 stops and two parent stops (based on Stops.txt) but at least the other ~16 stops shouldn't be returned anymore. Maybe since they refer to the same parent stop they don't get disabled/deleted.

And indeed they all have been updated at the same date 2018-02-27 but this is not the feed importation date (I guess it might relate more to the last osm_way_id match)

Maybe since they refer to the same parent stop they don't get disabled/deleted.

I think you may be on to something here.

And indeed it seems the updated_at date is the osm_way conflation date, because it's one day after the feed version import date, in the ones I verified.

@irees Do you have any ideas here? I'm not sure where to look further and how to address it.
I've tried locally to destroy the feed, but all stops/routes are still presents in the responses.

The route going through these stops also seems to reference every single one of it (in stop_served_by)