Unnecessary merge in _stops_agencyid()?

Question

Unnecessary merge in _stops_agencyid()?

kuanb opened this issue 7 years ago · 2 comments

In _stops_agencyid, a merge is made with one dataframe and another that is only one column wide. Since that column already exists in the left merged dataframe output, this step should not happen (or, it is missing another 2nd column, that should be in there!).

Happy to submit a PR to fix but wanted to check first as to what the intent was with this merge.

Location: https://github.com/UDST/urbanaccess/blob/73fcab917a3c044a5360836237910fcdd81da05e/urbanaccess/gtfs/utils_format.py#L300

This happens in a few of these helper functions, as well, such as:
https://github.com/UDST/urbanaccess/blob/73fcab917a3c044a5360836237910fcdd81da05e/urbanaccess/gtfs/utils_format.py#L241
in _calendar_agencyid

Which makes me wonder if the intent is to account for/capture column values where, after the left join, there are nulls? If so, I assume that is the purpose of the null check and update here:
https://github.com/UDST/urbanaccess/blob/73fcab917a3c044a5360836237910fcdd81da05e/urbanaccess/gtfs/utils_format.py#L478

Answer 1 · 2017-05-08T22:54:40.000Z

Thank you Kuan, good question. You are correct on your second assumption. The extra merge is intended to account for stop_ids that after the previous merges drop off due to edge cases in GTFS feeds where some stops are not uniformly utilized through the entire series of feed files. This way we account for all stops regardless of their utilization across GTFS files. Nonetheless it is useful for a user to keep those stops for accounting purposes.

I will close this issue, but happy to reopen if there is an issue.

Answer 2 · 2017-05-08T23:00:52.000Z

Thanks for the clarification :)