OSM query improvements
Closed this issue · 30 comments
Hi,
I think the current bounding box approach is not ideal. Transit networks can overlap and do not end necessarily at squared boxes.
I'd suggest to run a query to obtain relations with
- type=route
- public_transport:version=2
- network=XY
(and probably maintain the bounding box optional to just make the overpass api query more performant).
Does this make sense? I'd happy to provide a pull request for this. But I also think it's best to get feedback before. Thanks!
Thanks for creating an issue and discussing first before starting to code! :)
You definitively have a point here. However, in my area, the network
tag is not properly maintained and would need to be added to all routes first. Do you have an issue with the current bounding box approach in your area?
Like you say, we should still keep a bounding box for performance reasons. How about allowing to (optionally) also add a network
tag to the queries?
The public_transport:version
should also be optional since it is not always well maintained.
Generally, I think we should engage people towards using properly tagging and sticking to standards.
I'm fine with using the network
tag optionally on queries. This works for the problem just as good as doing it the other way around.
Although the public_transport:version=2
I suggest to include. Historically, In OSM there are two data structures for public transport (http://wiki.openstreetmap.org/wiki/Key:public_transport:version) and a lot (!) of inconsistencies. The schema version two is an necessary and collaborative effort to create a standard on how to map public transport om OSM. I don't think it's a problem to include it to existing relations. Especially because JOSM needs it to create and verify public transport data, it should be on the relations, if not they are likely to have a non-standard structure and would likely cause such scripts to fail.
Alright, do you want to work on adding the optional network
tag to the queries?
I agree that we should aim at making public_transport:version=2
mandatory, but I doing this now would make this script stop working for me. So unless there's a way to easily batch-edit all my routes, I think it should be at least possible to exclude this from the query.
Also when people start playing with the script, they might have a similar data quality situation and want to assess its feasibility for their use-case, without editing all their routes manually before they can do that. So being able to at least turn it off is probably a good idea.
Yes, I'm happy to provide a patch. But probably this not going to happen within the next days, because of the OSM conference marathon: State of the Map and HOT Summit next week. If anybody else works on it before, that's also good.
Don't worry, with JOSM, it's very easy to add a tag to a bunch of relations at once.
We could also do a little writeup on how people can assure to have good public transit data in OSM. That would help people to improve data quality easily and we could rely on a common ground of data structure for the script.
I observed a hard-coded backlist in code (see #11) and this seems to me to be a good example of why we want to improve the queries, not only by selecting by bounding box.
I think we can be even more flexible and introduce in the query
part of the configuration file (see #3) a dictionary of tags to respect in the query. This then could be network=SOMETHING
or even some custom tagging like blacklisted=yes
or whatever people use to select the routes they need.
So, I'd look like this, with the bbox being required and all tags to be optional.
{
"query": {
"bbox": {
"e": "-48.2711",
"n": "-27.2155",
"s": "-27.9410",
"w": "-49.0155"
}
"tags": {
"public_transport:version": "2",
"network": "BR-Floripa"
}
}
}
I like the idea with the query tag dictionary.
Just a note: As we already have a case with blacklisted routes (#11), which have a fixme=*
tag, we should also consider on including tags to exclude. Or - to keep it simple - just exclude by default all routes that have such a tag.
Let's keep it simple and just hardcode the fixme tag. What we could do though is to not exclude by fixme
from the query, but include those routes in the query and print warnings for them and say that they were not included, so users of the script know which routes to fix.
In my case I'm querying train route (that's why route_type is a parameter in some methods).
Just to be sure, adding the "route_type" key/value to the config[query][tags] should work?
Yes, the idea is to allow all kinds of tags to use for the query and not being limited to any special use case. Once it's implemented it will allow you to query for train routes as well.
Maybe it would be a good idea that you leave a message in an issue that you start working on to prevent that two people work on the same issue at the same time.
I haven't started yet. It looks to be easy to do. I can do it later or tomorrow. If anybody else wants to start before. Just tell here. Thanks!
I'm going to start with this in a branch called "issue-2" on my fork.
I haven't finish with this but can't continue right now, I committed to https://github.com/jamescr/osm2gtfs/tree/issue-2 which contains:
- update osm query on the methods: get_routes and refresh_route. I try to keep the original param. (still haven't update the methods get_route_masters and get_stops_of_route)
- fenix.json.example file was update with the query tags.
note: the value for the query tag "route_type" is passed as param in the querying methods. I haven't changed that by now just to "have" some key/value that reduce the returned relations in case that the tags param are empty. What about if we use the "public_transport:version" : "2" key/value for this reducing results purpose?
I have updated the methods get_route_masters and get_stops_of_route. I test it with fenix.json config file with no elements on the tags dictionary because that is how it was working before.
@xamanu has added tags to all routes, so you should get the same result with these query tags:
"tags": {
"public_transport:version": "2",
"network": "Sim",
"route": "bus"
}
Oh really, That's right, I'll test it with those tags.
Great, it's working nice 😄
I'll fix/improve some minor things and create the Pull Request.
I had a look on the work of @jamescr and he basically did two (very good) things. which I hope get into the the code soon:
- Moving to Overpass' QL instead from currently used, deprectaded XML queries
- Introducing tags, as discussed above to the queries.
XML queries are deprecated? Too bad. I found them much easier to read and understand :(
I expected them to be deprecated, but I actually see there is no formal message anywhere about this. So no, they are not deprecated! Sorry for the confusion. Probably it's just a personal preference then. For me QL is much easier to understand.
As a next step, I'd like to think about doing better queries. Therefore I made a list of queries that are in code currently and then we can think about being more efficient on them:
- Get all routes (variants) from bbox with some tags
- Get all route masters inside bbox
- Get all members of found route masters
- Get stops of specific route by relation id
- Get one route for refresh
Query number 4 runs repeatedly and is responsibly for the "too many requests" exception I was facing so many times. I think we can start thinking about maybe having one query for 1-4 all together and even the 5th would probably only be a variant with another parameter. Let me get my head around it and come up with a proposal.
And then there is another query, that hits the Overpass API very often. But this is very Fenix specific and I have no real idea on how to optimize this:
- Get significant names around a bus stop without a name
If you could really do 1-4 in one single query that would be awesome! Otherwise you also could do:
- Get all routes (variants and masters) within bbox with tags
- Get all stops in bbox with tags
It would probably be best to do this right within the existing fenix creators with minimal changes again.
There should probably be some sort of switch for the stop name finder, so it can be turned off easily. The stops will just have the default no-name then.
How does this look like?
You always have to click on the Run
tab on top, ... then wait ... the data can be inspected through the Data
and Map
tab
- Get everything in one: Route variants, their masters, their stops and their geometry (ways and way-nodes)
- Get all routes without geometry: Variant and master relations data only.
- Get all routes with geometry: Variant and master relations, ways and way-nodes.
- Get all stops that are part of selected route relations.
We could also do nice things like this:
Unbelievable this query magic! You are clearly a magician! 😸
Can't wait to see those implemented! When this is done and you use this language, it would be nice to have a comment that explains what each line does in case somebody else needs to maintain the query at some point.
Looks like this returns everything we need, so can even throw the OsmApi dependency away.
I don't know what's the best way to implement this. Maybe you prefer to implement the default creators with that and I'll adapt Fenix to that?
Good question.
The more I think about it the more it appears to me to be the root of everything. We probably wouldn't be able to implement the query improvements like this without implementing an overhaul of the data structure #30 and the refactoring the OsmHelper #31. On the other side doing refactoring of data structure and OsmHelper without optimization of the query would be tired as I'm not able to test it thoroughly with the current queries. Simply using cached stuff is not a solution when working on these parts.
Doing all together would become a bigger pull request, which we generally don't want, but doing it the Fenix way and not generic would cost us a lot of time and nerves (sorry, bloody German saying here). So I'm kind of opting to do the three things in one rush, test it properly with Fenix and then merge it.
What do you think? Any other idea? Suggestions?
ok. I talked with @jamescr today and we think we found a way to keep on doing it step by step. So we propose to do one first pull request with the following:
- Allow tags from config file for querying (@jamescr already implemented this)
- Incorporate the queries that download all data (either in one or two separate queries, which can be taken from the comment above)
- The function
get_stops_of_route
should used already downloaded data (instead of querying again and again) and match info based onosm_id
from relation data to stop data - The function
get_route_masters
can be merged withget_routes
Sounds good. If you can at all split up those 4 points further into separate PRs this would be even better.