OSM query improvements

Question

OSM query improvements

Closed this issue 8 years ago · 30 comments

Hi,

I think the current bounding box approach is not ideal. Transit networks can overlap and do not end necessarily at squared boxes.

I'd suggest to run a query to obtain relations with

type=route
public_transport:version=2
network=XY

(and probably maintain the bounding box optional to just make the overpass api query more performant).

Does this make sense? I'd happy to provide a pull request for this. But I also think it's best to get feedback before. Thanks!

Answer 1 · 2016-09-17T14:00:55.000Z

Thanks for creating an issue and discussing first before starting to code! :)

You definitively have a point here. However, in my area, the network tag is not properly maintained and would need to be added to all routes first. Do you have an issue with the current bounding box approach in your area?

Like you say, we should still keep a bounding box for performance reasons. How about allowing to (optionally) also add a network tag to the queries?

The public_transport:version should also be optional since it is not always well maintained.

Answer 2 · 2016-09-17T14:26:23.000Z

Generally, I think we should engage people towards using properly tagging and sticking to standards.

I'm fine with using the network tag optionally on queries. This works for the problem just as good as doing it the other way around.

Although the public_transport:version=2 I suggest to include. Historically, In OSM there are two data structures for public transport (http://wiki.openstreetmap.org/wiki/Key:public_transport:version) and a lot (!) of inconsistencies. The schema version two is an necessary and collaborative effort to create a standard on how to map public transport om OSM. I don't think it's a problem to include it to existing relations. Especially because JOSM needs it to create and verify public transport data, it should be on the relations, if not they are likely to have a non-standard structure and would likely cause such scripts to fail.

Answer 3 · 2016-09-17T14:44:07.000Z

Alright, do you want to work on adding the optional network tag to the queries?

I agree that we should aim at making public_transport:version=2 mandatory, but I doing this now would make this script stop working for me. So unless there's a way to easily batch-edit all my routes, I think it should be at least possible to exclude this from the query.

Also when people start playing with the script, they might have a similar data quality situation and want to assess its feasibility for their use-case, without editing all their routes manually before they can do that. So being able to at least turn it off is probably a good idea.

Answer 4 · 2016-09-17T14:48:26.000Z

Yes, I'm happy to provide a patch. But probably this not going to happen within the next days, because of the OSM conference marathon: State of the Map and HOT Summit next week. If anybody else works on it before, that's also good.

Don't worry, with JOSM, it's very easy to add a tag to a bunch of relations at once.

We could also do a little writeup on how people can assure to have good public transit data in OSM. That would help people to improve data quality easily and we could rely on a common ground of data structure for the script.

Answer 5 · 2016-09-18T09:16:13.000Z

I observed a hard-coded backlist in code (see #11) and this seems to me to be a good example of why we want to improve the queries, not only by selecting by bounding box.

I think we can be even more flexible and introduce in the query part of the configuration file (see #3) a dictionary of tags to respect in the query. This then could be network=SOMETHING or even some custom tagging like blacklisted=yes or whatever people use to select the routes they need.

So, I'd look like this, with the bbox being required and all tags to be optional.

{
    "query": {
        "bbox": {
            "e": "-48.2711",
            "n": "-27.2155",
            "s": "-27.9410",
            "w": "-49.0155"
        }
        "tags": {
            "public_transport:version": "2",
            "network": "BR-Floripa"
        }
    }
}

Answer 6 · 2016-09-18T16:55:35.000Z

I like the idea with the query tag dictionary.

Answer 7 · 2016-09-18T17:33:15.000Z

Just a note: As we already have a case with blacklisted routes (#11), which have a fixme=* tag, we should also consider on including tags to exclude. Or - to keep it simple - just exclude by default all routes that have such a tag.

Answer 8 · 2016-09-18T18:21:08.000Z

Let's keep it simple and just hardcode the fixme tag. What we could do though is to not exclude by fixme from the query, but include those routes in the query and print warnings for them and say that they were not included, so users of the script know which routes to fix.

Answer 9 · 2016-09-19T16:26:35.000Z

In my case I'm querying train route (that's why route_type is a parameter in some methods).

Just to be sure, adding the "route_type" key/value to the config[query][tags] should work?

Answer 10 · 2016-09-19T16:29:26.000Z

Yes, the idea is to allow all kinds of tags to use for the query and not being limited to any special use case. Once it's implemented it will allow you to query for train routes as well.

Answer 11 · 2016-09-19T16:55:15.000Z

Maybe it would be a good idea that you leave a message in an issue that you start working on to prevent that two people work on the same issue at the same time.

Answer 12 · 2016-09-19T17:44:34.000Z

I haven't started yet. It looks to be easy to do. I can do it later or tomorrow. If anybody else wants to start before. Just tell here. Thanks!

Answer 13 · 2016-09-19T18:59:31.000Z

I'm going to start with this in a branch called "issue-2" on my fork.

Answer 14 · 2016-09-19T21:43:27.000Z

I haven't finish with this but can't continue right now, I committed to https://github.com/jamescr/osm2gtfs/tree/issue-2 which contains:

update osm query on the methods: get_routes and refresh_route. I try to keep the original param. (still haven't update the methods get_route_masters and get_stops_of_route)
fenix.json.example file was update with the query tags.

note: the value for the query tag "route_type" is passed as param in the querying methods. I haven't changed that by now just to "have" some key/value that reduce the returned relations in case that the tags param are empty. What about if we use the "public_transport:version" : "2" key/value for this reducing results purpose?

Answer 15 · 2016-09-21T14:56:59.000Z

I have updated the methods get_route_masters and get_stops_of_route. I test it with fenix.json config file with no elements on the tags dictionary because that is how it was working before.

Answer 16 · 2016-09-21T22:29:04.000Z

@xamanu has added tags to all routes, so you should get the same result with these query tags:

        "tags": {
            "public_transport:version": "2",
            "network": "Sim",
            "route": "bus"
        }

Answer 17 · 2016-09-22T14:16:32.000Z

Oh really, That's right, I'll test it with those tags.

Answer 18 · 2016-09-22T20:17:17.000Z

Great, it's working nice 😄

I'll fix/improve some minor things and create the Pull Request.

Answer 19 · 2016-10-02T12:40:08.000Z

This is related to the refactoring of OsmHelper #31

Answer 20 · 2016-10-02T14:15:10.000Z

I had a look on the work of @jamescr and he basically did two (very good) things. which I hope get into the the code soon:

Moving to Overpass' QL instead from currently used, deprectaded XML queries
Introducing tags, as discussed above to the queries.

Answer 21 · 2016-10-02T14:20:59.000Z

XML queries are deprecated? Too bad. I found them much easier to read and understand :(

Answer 22 · 2016-10-02T14:30:29.000Z

I expected them to be deprecated, but I actually see there is no formal message anywhere about this. So no, they are not deprecated! Sorry for the confusion. Probably it's just a personal preference then. For me QL is much easier to understand.

Answer 23 · 2016-10-02T15:29:02.000Z

As a next step, I'd like to think about doing better queries. Therefore I made a list of queries that are in code currently and then we can think about being more efficient on them:

Get all routes (variants) from bbox with some tags
Get all route masters inside bbox
Get all members of found route masters
Get stops of specific route by relation id
Get one route for refresh

Query number 4 runs repeatedly and is responsibly for the "too many requests" exception I was facing so many times. I think we can start thinking about maybe having one query for 1-4 all together and even the 5th would probably only be a variant with another parameter. Let me get my head around it and come up with a proposal.

And then there is another query, that hits the Overpass API very often. But this is very Fenix specific and I have no real idea on how to optimize this:

Get significant names around a bus stop without a name

Answer 24 · 2016-10-02T15:40:05.000Z

If you could really do 1-4 in one single query that would be awesome! Otherwise you also could do:

Get all routes (variants and masters) within bbox with tags
Get all stops in bbox with tags

It would probably be best to do this right within the existing fenix creators with minimal changes again.

Answer 25 · 2016-10-02T15:41:01.000Z

There should probably be some sort of switch for the stop name finder, so it can be turned off easily. The stops will just have the default no-name then.

Answer 26 · 2016-10-02T21:11:17.000Z

How does this look like?

You always have to click on the Run tab on top, ... then wait ... the data can be inspected through the Data and Map tab

Get everything in one: Route variants, their masters, their stops and their geometry (ways and way-nodes)
Get all routes without geometry: Variant and master relations data only.
Get all routes with geometry: Variant and master relations, ways and way-nodes.
Get all stops that are part of selected route relations.

We could also do nice things like this:

All route relations that are not part of any master relation

Answer 27 · 2016-10-02T22:32:01.000Z

Unbelievable this query magic! You are clearly a magician! 😸

Can't wait to see those implemented! When this is done and you use this language, it would be nice to have a comment that explains what each line does in case somebody else needs to maintain the query at some point.

Looks like this returns everything we need, so can even throw the OsmApi dependency away.

I don't know what's the best way to implement this. Maybe you prefer to implement the default creators with that and I'll adapt Fenix to that?

Answer 28 · 2016-10-04T15:06:17.000Z

Good question.

The more I think about it the more it appears to me to be the root of everything. We probably wouldn't be able to implement the query improvements like this without implementing an overhaul of the data structure #30 and the refactoring the OsmHelper #31. On the other side doing refactoring of data structure and OsmHelper without optimization of the query would be tired as I'm not able to test it thoroughly with the current queries. Simply using cached stuff is not a solution when working on these parts.

Doing all together would become a bigger pull request, which we generally don't want, but doing it the Fenix way and not generic would cost us a lot of time and nerves (sorry, bloody German saying here). So I'm kind of opting to do the three things in one rush, test it properly with Fenix and then merge it.

What do you think? Any other idea? Suggestions?

Answer 29 · 2016-10-05T15:18:29.000Z

ok. I talked with @jamescr today and we think we found a way to keep on doing it step by step. So we propose to do one first pull request with the following:

Allow tags from config file for querying (@jamescr already implemented this)
Incorporate the queries that download all data (either in one or two separate queries, which can be taken from the comment above)
The function get_stops_of_route should used already downloaded data (instead of querying again and again) and match info based on osm_id from relation data to stop data
The function get_route_masters can be merged with get_routes

Answer 30 · 2016-10-05T15:21:41.000Z

Sounds good. If you can at all split up those 4 points further into separate PRs this would be even better.