Unnecessary merge in _stops_agencyid()?
kuanb opened this issue · 2 comments
In _stops_agencyid
, a merge is made with one dataframe and another that is only one column wide. Since that column already exists in the left merged dataframe output, this step should not happen (or, it is missing another 2nd column, that should be in there!).
Happy to submit a PR to fix but wanted to check first as to what the intent was with this merge.
Location: https://github.com/UDST/urbanaccess/blob/73fcab917a3c044a5360836237910fcdd81da05e/urbanaccess/gtfs/utils_format.py#L300
This happens in a few of these helper functions, as well, such as:
https://github.com/UDST/urbanaccess/blob/73fcab917a3c044a5360836237910fcdd81da05e/urbanaccess/gtfs/utils_format.py#L241
in _calendar_agencyid
Which makes me wonder if the intent is to account for/capture column values where, after the left join, there are nulls? If so, I assume that is the purpose of the null check and update here:
https://github.com/UDST/urbanaccess/blob/73fcab917a3c044a5360836237910fcdd81da05e/urbanaccess/gtfs/utils_format.py#L478
Thank you Kuan, good question. You are correct on your second assumption. The extra merge is intended to account for stop_ids that after the previous merges drop off due to edge cases in GTFS feeds where some stops are not uniformly utilized through the entire series of feed files. This way we account for all stops regardless of their utilization across GTFS files. Nonetheless it is useful for a user to keep those stops for accounting purposes.
I will close this issue, but happy to reopen if there is an issue.
Thanks for the clarification :)