Investigate cost of adding metadata
Closed this issue · 24 comments
Possibly add one or all of:
- Changeset ID
- User ID
- Object version number
- Timestamp
We would ignore metadata for nodes that have no tags.
Hi @bdon. I'm a big fan of OSMExpress and the Protomaps extract service. At my company we have some internal tooling that relies on the object version number for caching. We don't need the changeset/user/timestamp. Would you consider adding version numbers to Protomaps extracts? Thanks.
are you working with a .osmx locally or just a .pbf extract? If an .osmx is it a region or the whole planet? I'm wary to implement this because it will probably double the total db size.
Ideally: metadata is optional, and you won't pay the storage cost for it if you don't use it. but I think this depends on migrating from capnproto to flatbuffers (#1) because of how empty fields are stored.
We are just working with .pbf extracts for now.
- added version, timestamp, changeset, uid, username to database
- currently working on a new planet import to confirm the expansion in size is reasonable
- untagged nodes are ignored
- still using capnproto
on an AWS i3.xlarge
instance, osmx expand planet.osm.pbf planet.osmx
took exactly 7 hours and resulted in a 643G planet.osmx file. The expansion in size when adding all metadata (ignoring untagged nodes) should be less than 10% total, so I'd prefer to always include metadata.
download server at http://protomaps.com/extracts now includes version and timestamp information
@invisiblefunnel let me know if this is working for you; I'm working on the ecosystem around these tools so I'm interested in what people are building!
Thanks @bdon! This is great news. I'll take a look this week and reply back.
I just grabbed an extract from protomaps, loaded it into josm, fixed a road's name, and uploaded the change. This demonstrates that the extract had the required meta-data (version). I also manually verified that elements had edited at
and edited by
attributes.
However... I cannot use this as a source to change the shape of a road, since most of the way's nodes are tag-less, and you don't provider them with a version
.
Please reconsider including meta-data (or at least version
) on tag-less points. That would allow the extract to be used for any type of edit.
Please reconsider including meta-data (or at least version) on tag-less points.
FWIW this is also a blocker for my use cases which rely on the ID and version to uniquely identify objects in time detect changes.
just to confirm - to make this work for your use cases only version
is needed and no other metadata?
Yes, just the version is needed. We don't use timestamps at all.
Confirmed, version
would make exports usable for editing projects. I can't think of a reason I'd want other meta-data on tag-less nodes, and I'm sure any reason I eventually think of won't justify the cost.
changed location values from a 64 bit integer to a 96-bit struct that includes the version
AWS i3.xlarge: osmx expand planet.osm.pbf planet.osmx
took 7.38 hours and resulted in a 666G planet.osmx file. so another 3-5% bump in expand time and planet size. need to verify now that this is correct and benchmark some extracts, because the page fault rate when accessing locations should be higher now.
so another 3-5% bump in expand time and planet size
That seems pretty reasonable. For the augmented diff use case #17, version information is useful for the same reason as @invisiblefunnel mentioned above, it allows for unique identification of a particular node in order to match it to its metadata.
the page fault rate when accessing locations should be higher now
Can you describe this a bit more?
the page fault rate when accessing locations should be higher now
Locations were previously stored as 64 bit integers. The records for the "Locations" table in the osmx file occupy contiguous pages of storage on disk, ordered by node ID. Adding a 32 bit version number increases the record size by 50%, so less records fit on a single disk page.
When osm extract
is run, a way's member nd
references are resolved into lat/lng by seeking over the locations
table; in order of increasing way id. This has very poor locality; extracting Boston might include ways 12345 and 12346, but ways 12345 and 12346 might reference nodes anywhere from 1 to 1000000; the node ID is essentially random (unless it's a set of ways and nodes that were all created around the same time and not edited heavily)
the osmx design (by using lmdb) implements no application level caching. it relies on the kernel to cache pages as they are retrieved from disk. This is tuned to automatically manage a pool in RAM of cached disk pages. Since the locations table is now less dense, it's more likely when fetching Locations that you will need a page that has not been fetched yet or has been evicted from cache.
This is just my performance hypothesis, I need to run some benchmarks to determine whether or not it makes any significant difference.
Here's my test region:
osmx extract planet.osmx benchmark.osm.pbf --bbox 38.462,-77.519,41.0130,-73.333
first run on versionless planet: 943 seconds
second run: 919 seconds
version planet: 873 seconds
version planet 2nd time: 773 seconds
echo 3 > /proc/sys/vm/drop_caches
can be used to clear the page cache, but the extract is probably big enough so that it doesn't make a difference. This isn't a very controlled experiment because the versionless planet has been being updated for a few weeks and might be more fragmented. In any case, it doesn't look like adding versions to locations negatively affects the speed by that much.
@blackboxlogic @invisiblefunnel new planet with versions is now online - can you try on https://protomaps.com/extracts ?
Works perfectly for me. Many thanks @bdon.
Every element has a version number, so the extracts are usable for editing.
Tag-less nodes don't have edited at
, which is expected. However, I'm noticing that edited by
and changeset
are both 0 for all objects. Is that intentional?
Yes, the data is stored but I intentionally am excluding it. That seems to be the convention for GDPR compliance. Is that needed for any of your applications?
I definitely don't need it but it could be plausibly useful* and if you're storing it already then there isn't much to gain by withholding it. Other services handle GDPR by offering the "pii" only to OSM users who have signed in with oAuth, since they have agreed to terms of service. That would, of course, complicate your service by involving oAuth.
*Possible use-case: A vandal changes all buildings into parks, I want to remove all leisure=park
where [vandal's name] was the last editor. I've had to do this sort of thing a few times.
I have an auth system built which is separate from osm Oauth. I could make PII only available to logged in users.
Can you describe your editing workflow in more detail ? I’d like to include it in my SOTM talk and I can mention your username if that’s ok.
Re: "Describe your workflow"
Short version: "I've build up a collection of scripts which can be chained together" but I think that's just called "programming"?
Here's a recent example of my work, but I plan to do more and there are two parts of my pipeline I'm rewriting (pulling data from OSM, and schema translation). One of the more cumbersome parts was retrieving up-to-date large regions of data from OSM. It was awkward for multiple reasons and my future projects will benefit from your work.
If you want a longer description shoot me an email [blackboxlogic at gmail dot com] with your phone number and best time to call, I'd love to chat.
Yes, "ok" to mention my username.
Great, we can discuss over email.