pelias/pbf2json

OSM Multipolygons not included in exports

Closed this issue · 19 comments

First of all, thanks for putting this tool together! I've found there are not many great tools out there for denormalizing and exporting OSM polygon data and this is one of the better ones I've found.

Recently I was using pbf2json to export some large OSM files to json and I found that it seems to omit multipolygon relations from the denormalized output.

Here's an example PBF file which I have been testing with: https://www.dropbox.com/s/9gbqm922jsbzxnl/dc_sample.pbf?dl=0

This was exported from the following OSM api query:

wget -O ~/Downloads/dc_sample.osm "http://api.openstreetmap.org/api/0.6/map?bbox=-77.04935073852539,38.89196809844948,-77.03342914581299,38.90265604620856"

And here is an example of an OSM relation which would hopefully get included when exporting this file against the building tag: http://www.openstreetmap.org/relation/1029285

(For example here's the command I am using to export):

pbf2json -tags="building" dc_sample.pbf > dc_sample.json

I'd love to discover that I'm doing something incorrectly here and it is actually possible to retrieve these.

Unfortunately I suspect the root issue here may have to do with the somewhat complicated way in which OSM relations have to be processed. I've been researching this problem a bit myself and while I'm still not sure I totally have a handle on it I found this repo to have some helpful info in understanding the issue: https://github.com/osmlab/fixing-polygons-in-osm.

Thanks again for the work you have put into this project. Let me know if I can provide any more useful information!

hey @worace, thanks for the kind comments :)

the issue with relations is their complexity, the OSM model is a 'normalized model' and so it's great for storing in a relational database but tricky to handle in a stream.

a single pass over the file is unfortunately impossible, considering the model is:

  • node: is a single point
  • way: has two or more node references
  • relation: has one or more member relations / ways or nodes

and the order of elements in the file is: nodes, ways, relations

so what I'm doing here is running over the file once and storing all the nodes in a temporary database (the leveldb) and then when getting to the ways, I can just look up those coords from the leveldb and do a little work to assemble the way.

a way can be either open (like a road) or closed (like a building) and so there are two centre point algorithms in this repo to handle computing those two cases.

so... relations... are a different beast, in order to assemble them in a memory efficient way, we would need to also now load all the way references in to the leveldb and the database is going to get pretty big pretty fast.

there are also numerous edge cases to deal with such as super-relations and 'parent' roles.

as it is, the nodes only database is ~100GB on disk, because it contains all the nodes (when it could only contain the nodes we need for the ways we need).

so... that's the tl;dr, but the good news is I have been working on a solution for a while now and I have published it as https://github.com/missinglink/pbf

included in that library is a command called json-flat which is basically the same thing as pbf2json, it is much faster and only stores the nodes which we need in the DB, so the leveldb for the whole planet is now ~10GB instead of ~100GB.

Using this I can extract the whole planet to JSON in ~35 mins, plus it has other good things like improved floating point accuracy.

The command itself is not that well documented yet but I have created a branch of this repo which uses the new lib.

In that branch you will be able to find compiled binaries for your system, the json-flat command currently doesn't try to assemble relations, although I plan to add that functionality. There is another command called boundaries which used to be able to extract boundaries, I think I broke it, it might need some more work to get it working again.

How's your Go?

A while back I published osm-boundaries which is a repo of all the osm boundaries assembled and encoded as geojson, you might be able to just download your geometry from there?

Sorry for the mind-dump, let me know if that's helpful and maybe we can work together on a solution?

@missinglink thanks for sharing that background. The workflow you are describing squares with my understanding based on everything I've been able to read about this issue so far.

I have done a little bit of work experimenting with trying to do this processing using pyosmium (which is osm's python binding for libosmium).

I don't know that much about the internals of it but libosmium seems to include the logic for the node-filtering you described (i.e. they have a function for collecting only the nodes that are included in ways you are trying to reconstruct).

However as you described, if you want Relations as well it does still require a second pass to pre-build the list of which ways you need to collect to construct your relations.

The stuff I had so far is pretty rough still but for what it's worth I just pushed it up here in case it is of any use: https://github.com/worace/osm_denorm.

For what it's worth in my particular use-case I'm mostly interseted in building polygons, so I'm able to discard most of the nodes and lots of other geometries, which makes the size constraints a little bit simpler. But I still think it's great to have a general purpose tool like you are trying to build.

Unfortunately I've had very little experience working with go, or else I would have tried to dig in more here in depth. But I'm always curious to learn so perhaps I can find some time to read through these projects more closely.

Also, it seems like a fair amount of work has gone into implementing all of the go tooling around this problem (e.g. https://github.com/qedus/osmpbf) -- I'm curious if there were issues you found with osm's existing implementations like libosmium? Or you simply needed the golang interoperability but couldn't get that with the existing tools?

I'm pretty sure I can enable support for multipolyon relations without too much work.

I'll try to have another look tomorrow but it might take a week or so to find time.

Regarding the existing libraries, when I originally wrote the code years ago I wasn't able to find anything that did what I want in a satisfactory way. My new lib supports indexing the PBF file which is nice because you can "rewind' the file and reread parts without having to start at the beginning.

I think an optimal solution is simply doing a first pass to create the index (a list of byte offsets for each block) and a bitset mask (stores all the IDs of the target nodes/ways/relations). Then doing a partial second pass over the ways only in order to ensure all the node references belonging to ways which are members of a relation are included.

Then a full second pass of the file allows caching data in leveldb and writing the json.

With this approach the extraction time for the planet could most likely be 45-60 mins and the RAM and disk usage would be minimised. If you wanted only the relations it could be even faster.

Regarding the assembly code, I tried a few different things but I found the cleanest to be using a function in spatialite which is designed to take an arbitrary collection of polylines and assemble polygons.

Have a look over this thread a similar discussion pelias/openstreetmap#81

Also worth mentioning that some boundary relations have a member which is a node, the node is the position of where to draw the label centroid and so it's a easy way of getting a nice center point without doing any polygon assembly at all!

I've got a branch going here missinglink/pbf#9

@worace the code is all there now and I tested it, it works great.

I'm still having one issue with cross-compilation which I'm not sure how to solve, due to the dynamically linked C library for spatialite.

If you're on linux I can give you a binary to test?

@missinglink that's awesome! Linux binary would be great. I can give it a try tomorrow (or whenever you're ready -- no rush)

ok, must be late for you, I'll test a little more and link you a copy, I'm thinking for other OS's that I'll just make a docker until I can resolve the xcompile issues

so.. unfortunately there is an issue which affects your DC sample but not my test data and it might take some time for me to think up a nice fix for it.

I tried extracting all the administrative boundaries from the planet file using this config:

$ cat boundary.config 
{
  "relation": [
    [ "name", "boundary=administrative" ]
  ]
}

it took around 20 minutes, which is pretty much how long it takes to parse the 36GB file anyway.

it seems like the success rate was only ~50% so there's probably some more work to be done there, I'm not sure yet why some of them assemble correctly and some don't, it's possibly just because they're messed up geometries that need to be fixed.

$ wc -l boundaries.out boundaries.err 
  192852 boundaries.out
  216509 boundaries.err
  409361 total
$ tail -n2 boundaries.out

{"id":6577259,"type":"relation","tags":{"admin_level":"6","alt_name":"Shirin Tagab District;Koh-i-Saiyād;Shirintagab","boundary":"administrative","int_name":"Shirin Tagab","name":"شیرین‌تگاب","name:en":"Shirin Tagab","name:fa":"ولسوالی شیرین‌تگاب","name:ps":"شيرين تگاب ولسوالۍ","name:tr":"Şirin Tagab","type":"boundary","wikidata":"Q3694468","wikipedia":"en:Shirin Tagab District"},"centroid":{"lat":36.268353,"lon":64.851728}}

{"id":6577487,"type":"relation","tags":{"admin_level":"8","boundary":"administrative","name":"Richardson","type":"boundary"},"centroid":{"lat":32.9481789,"lon":-96.7297206}}

on the plus side, it seems like simpler relations like buildings, etc have a much higher success rate

I probably won't look at this again till Monday but I've compiled a binary for you which runs on linux http://missinglink.geo.s3.amazonaws.com/pbf

This is just a build of this branch so you can compile your own if you feel like it, or if you want to make changes / read over the code.

The binary has a --help but I haven't done a great job of documenting the core concepts yet, it's a little different from how pbf2json worked, at a minimum you'll have to:

  • create a config file which defines the OSM features you want to extract
  • run genmask with the -i flag to generate a mask file and simultaniously index the PBF file
  • run json-flat to extract the data

something like this:

echo 'generate mask';
time ./pbf genmask -c "${CONFIG_FILE}" -i "${PBF_FILE}" "${MASK_FILE}";

echo 'index stats';
./pbf index-info "${PBF_FILE}.idx";

echo 'mask stats';
./pbf bitmask-stats "${MASK_FILE}";

echo 'extract json';
time ./pbf json-flat -m "${MASK_FILE}" -l "${LEVELDB_DIR}" "${PBF_FILE}" 1> "${JSON_FILE}" 2> "${ERROR_FILE}";

you can name MASK_FILE whatever you like, something like dc_buildings.mask, you'll need to manually create the LEVELDB_DIR directory.

your config file is JSON and looks like this:

{
  "node": [
    [ "building" ]
  ],
  "way": [
    [ "building" ]
  ],
  "relation": [
    [ "building" ]
  ]
}

it's similar to how pbf2json worked with some minor differences:

  • you can distinguish between nodes/ways/relations
  • you can add AND conditions by adding new elements horizontally, eg: [ "building", "name" ]
  • you can add OR condition by adding new elements vertically, eg: [ [ "building" ], [ "shop" ] ]
  • matching key/value pairs used to be cuisine~vegan, now it's cuisine=vegan

the rest is pretty self explanatory, have a look over the --help and the readme for more hints.

also remember that your DC extract currently isn't supported and it'll probably throw a nasty panic at you, you could try another extract from mapzen metro extracts instead.

I actually think the spatialite C binding is broken, when I run it in docker I get a load of shaxbee/go-spatialite: spatialite extension not found. because the dynamically linked spatialite C lib is not found on the system.

Not sure how to fix this yet, it would be ideal if I could somehow compile the C code in with the Go code

hey @worace I don't have the time to fix the C binding and cross compilation issues right now, although I'd be happy to accept a PR.

in the mean time, I simply dockerized the build and execution of the pbf executable, I included the C dependency in that docker so it should all run fine.

I'm going to let this sit for a week or so, I'd be grateful if you could test it and offer feedback.

@missinglink thanks for the update; no worries about the native deps, i'll see if they compile for me or try it in docker if not. Going to take some time to try this out today and tomorrow; will let you know how it goes.

@missinglink finally had a chance to try this out on ubuntu. the new binary you provided seems to work and with the right flags I am able to get JSON geometries for ways.

However so far I don't think the relations are working as I seem to be getting this error for all of them:

2017/07/18 18:31:57 spatialite: failed to assemble relation: 7393969

For what it's worth here is the script I've been using to run tests on it:

#!/usr/bin/env bash
cd /tmp
rm -rf pbf_test/

mkdir pbf_test
mkdir pbf_test/leveldb
cd pbf_test

#SOURCE="http://download.geofabrik.de/north-america/us/rhode-island-latest.osm.pbf"
#SOURCE="http://download.geofabrik.de/north-america/us/district-of-columbia-latest.osm.pbf"
SOURCE="http://download.geofabrik.de/north-america/us/alabama-latest.osm.pbf"

wget $SOURCE
wget "http://missinglink.geo.s3.amazonaws.com/pbf"
chmod +x pbf

echo '{"node":[["building"]], "way":[["building"]], "relation":[["building"]]}' > config.json

PBF=$(echo $SOURCE | ruby -e "puts STDIN.read.split('/').last")
MASK="mask"
CONF="config.json"
LEVELDB="leveldb"
OUT="out.json"

./pbf genmask -c $CONF -i $PBF $MASK
./pbf index-info "$PBF".idx
./pbf bitmask-stats $MASK
./pbf json-flat -g -v -m $MASK -l $LEVELDB  $PBF > $OUT

Based on some quick reading it seems that the line I'm hitting is here:

https://github.com/missinglink/pbf/blob/support_relations/handler/denormalized_json.go#L194-L195

Although interestingly I'm not seeing a second line of out put that would indicate what the specific error there is, just a bunch of spatialite lines together like this:

2017/07/18 18:39:17 spatialite: failed to assemble relation: 7154816
2017/07/18 18:39:17 spatialite: failed to assemble relation: 7218943
2017/07/18 18:39:17 spatialite: failed to assemble relation: 7236876
2017/07/18 18:39:17 spatialite: failed to assemble relation: 7237053

I haven't had time to look much deeper than that but I'll try to get back to it soon and see if I can read through more of the related code.

I also briefly tried building it on my mac. I didn't (yet) get any errors about building the spatialite dependencies, but was getting some errors there around function signatures for the CLI app when I tried to run it:

go build
# github.com/mattn/go-sqlite3
clang: warning: argument unused during compilation: '-pthread' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-gno-record-gcc-switches' [-Wunused-command-line-argument]
# github.com/mattn/go-sqlite3
clang: warning: argument unused during compilation: '-pthread' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-gno-record-gcc-switches' [-Wunused-command-line-argument]

./pbf genmask -c config.json -i ~/data/osm/osm_rhode_island.pbf mask

ERROR invalid Action type. Must be `func(*Context`)` or `func(*Context) error).  This is an error in the application.  Please contact the distributor of this application if this is not you.See https://github.com/urfave/cli/blob/master/CHANGELOG.md#deprecated-cli-app-action-signature

I haven't spent a ton of time on this yet and it's very possible my golang dev setup isn't quite right since I basically installed it to try this build.

Thank you for publishing this great tool. Any update about the issue? It would help me a lot :-)

I would recommend using the excellent https://docs.osmcode.org/osmium/latest/osmium-export.html tool.

The OSM relation model is convoluted and very difficult to work with, I have some draft code which assembles polygons but for now there are only a couple of tools capable of converting it to geojson.

For reference the tools are:

  • osm2pgsql
  • osmium
  • osm2geojson (code is very hard to follow)

@worace @mdorda, this library now supports relations.

Wow, great work and thanks for the follow up. I have to admit we have been using osmium for a while to get around these issues, but hopefully I will find an occasion to try out pbf2json again soon.

Cool, I'm also a big fan of osmium-tool but the tag filtering isn't as flexible as I'd like it to be.

Let me know how they compare from a performance point-of-view, I'd be interested to hear if it's comparable to the C implementation.