Improve the way `linear_tags` and `area_tags` work
dagnelies opened this issue · 11 comments
First of all, thanks for maintaining this great tool.
The default behavior is to export "closed ways" as both a LineString
and a MultiPolygon
...this makes a lot of "duplication" by default. For instance, every little building is a "closed way" and will be present in the export twice, as LineString and MultiPolygon, which inflates the data quite a lot. Although this is the most obvious tag leading to the most duplication, other tags suffer the same duplication issue: highways, barriers...
After digging in the docs, you may think: "Ok, I can use the linear_tags
and area_tags
for that." but it does work out. The way it works now is to act as a "filter".
If you specify:
{
"linear_tags": ["highway"],
"area_tags": ["building"]
}
You will loose all other ways/areas not having one of the two tags.
IMHO it would be better to leave the filtering to the include_tags
/exclude_tags
options only.
This would imply altering the behavior of linear_tags
/area_tags
:
- if it's in
linear_tags
, output it asLineString
- if it's in
area_tags
, output it asMultiPolygon
- if it's in neither, output it as both (instead of none) <= the big change
The meaning of true
/false
/null
would remain unaffected.
Just for reference, here is the current docs regarding area handling:
For a closed way (with the last node location the same as the first node location) the tags are checked: If the way has an area=yes tag, an area is created. If the way has an area=no tag, a linestring is created. An area tag with a value other than yes or no is ignored. The configuration settings area_tags and linear_tags can be used to augment the area check. If any of the tags matches the area_tags, an area is created. If any of the tags matches the linear_tags, a linestring is created. If both match, an area and a linestring is created. This is important because some objects have tags that make them both, an area and a linestring.
Also, perhaps adding the building
tag to areas would be a sensible default since it makes a very substancial part of the duplication.
The whole linear vs area thing is complex and can't really be solved with simple lists of tags. So Osmium can only solve some rather simple use cases here, if you need something more I suggest using something like osm2pgsql which has a complete configuration language built in which allows you much more freedom.
Because we need to keep backwards compatibility, any kind of change has also to be considered well. So if we want to change this at all it has to be in some way that old configs will still do the same thing.
Indeed, my suggestion would affect backwards compatibility. Therefore, I understand the reluctance ...even though I still find it really meaningful. Both because it's slightly unintuitive that these acts as filters too and because they are impractical to use. You cannot meaningfully decide which tag should be what without losing all other unlisted tags. This is quite harsh and makes usage of these options impractical ...I honestly wonder if people use it.
There is currently no way to avoid large data duplication. We are talking about lots of dupe data here, for example ~60% of ways are buildings and duplicated, which is quite a lot. But we cannot de-duplicate them without loosing the other tags as a side-effect. :/
For full backwards compatibility, indeed another option would be required.
{
"linear_tags": ["highway"],
"area_tags": ["building"],
"include_unlisted_tags_as_both": true
}
...but that would make usage slightly awkward IMHO.
By the way, the side effect of the "breaking change" would be filtering less data than before in the worst case, while the fix would be to simply add the list in the "include_tags"
option. (Edit edit: just tested it, would work as expected)
The result would also be more intuitive IMHO since the area_tags
/linear_tags
would strictly be responsible for how to handle geometry, while include_tags
/exclude_tags
would strictly be for filtering output. Instead of the current case where the area_tags
/linear_tags
are implicitely also an include_tags
.
It's your call. I just wanted to state my point of view as a user.
The include/exclude_tags
do a different thing. They do not filter objects, but only those specific tags. They are used for getting rid of tags such as source
which most people don't need which would otherwise clutter up the output. But they don't prevent the object with those tags to be written out.
What most people probably want is to set area_tags
to some list of tags and then set linear_tags
to null
. This way you get all data as either area or linear with no duplication and nothing filtered out.
What most people probably want is to set area_tags to some list of tags and then set linear_tags to null. This way you get all data as either area or linear with no duplication and nothing filtered out.
That would be a possibility ....but it's kind of tricky to find out what the list should be
For the sake of completeness, here is the most common configuration found in various repositories:
"linear_tags": ["highway", "barrier", "natural=coastline"],
"area_tags": ["aeroway", "amenity", "building!=no", "landuse", "leisure", "man_made", "natural!=coastline"],
I do not claim it is ideal/perfect. It is merely the result of a github search for such configurations repository-wide of what people use in practice right now. It is indeed a very rough approximation.
That said, even it is an approximation to avoid duplicated data, I see "misinterpretation" of areas / ways as the smaller issue. The bigger issue IMHO is that the way people currently use it, it simply removes all ways where none of the keys/tags are in the lists.
Thanks for the ideas @dagnelies, but we are keeping the current behaviour. Closing here.
Ok. Just for the sake of completeness, here is what I used in the end for my project to distinguish between ways and polygons:
"linear_tags": ["highway", "natural=coastline", "waterway", "barrier", "wall", "footway", "bridge", "tunnel", "railway", "power", "crossing","area=no"],
"area_tags": ["building", "surface", "landuse", "natural!=coastline", "amenity", "leisure", "water", "parking", "sport", "crossing", "golf","area!=no", "boundary", "wetland"],
This is likely not 100% perfect either, but should roughly keep most features while cutting down a sizeable amout of duplicates.