Investigate cost of adding metadata

Question

Investigate cost of adding metadata

Closed this issue 5 years ago · 24 comments

Possibly add one or all of:

Changeset ID
User ID
Object version number
Timestamp

We would ignore metadata for nodes that have no tags.

Answer 1 · 2020-02-15T21:53:54.000Z

Hi @bdon. I'm a big fan of OSMExpress and the Protomaps extract service. At my company we have some internal tooling that relies on the object version number for caching. We don't need the changeset/user/timestamp. Would you consider adding version numbers to Protomaps extracts? Thanks.

Answer 2 · 2020-02-15T22:39:52.000Z

are you working with a .osmx locally or just a .pbf extract? If an .osmx is it a region or the whole planet? I'm wary to implement this because it will probably double the total db size.

Ideally: metadata is optional, and you won't pay the storage cost for it if you don't use it. but I think this depends on migrating from capnproto to flatbuffers (#1) because of how empty fields are stored.

Answer 3 · 2020-02-15T23:18:14.000Z

We are just working with .pbf extracts for now.

Answer 4 · 2020-06-04T03:38:45.000Z

added version, timestamp, changeset, uid, username to database
currently working on a new planet import to confirm the expansion in size is reasonable
- untagged nodes are ignored
- still using capnproto

Answer 5 · 2020-06-04T11:21:57.000Z

on an AWS i3.xlarge instance, osmx expand planet.osm.pbf planet.osmx took exactly 7 hours and resulted in a 643G planet.osmx file. The expansion in size when adding all metadata (ignoring untagged nodes) should be less than 10% total, so I'd prefer to always include metadata.

Answer 6 · 2020-06-05T03:13:53.000Z

download server at http://protomaps.com/extracts now includes version and timestamp information

@invisiblefunnel let me know if this is working for you; I'm working on the ecosystem around these tools so I'm interested in what people are building!

Answer 7 · 2020-06-09T01:41:46.000Z

Thanks @bdon! This is great news. I'll take a look this week and reply back.

Answer 8 · 2020-06-09T14:32:47.000Z

I just grabbed an extract from protomaps, loaded it into josm, fixed a road's name, and uploaded the change. This demonstrates that the extract had the required meta-data (version). I also manually verified that elements had edited at and edited by attributes.

However... I cannot use this as a source to change the shape of a road, since most of the way's nodes are tag-less, and you don't provider them with a version.

Please reconsider including meta-data (or at least version) on tag-less points. That would allow the extract to be used for any type of edit.

Answer 9 · 2020-06-09T14:46:26.000Z

Please reconsider including meta-data (or at least version) on tag-less points.

FWIW this is also a blocker for my use cases which rely on the ID and version to ~~uniquely identify objects in time~~ detect changes.

Answer 10 · 2020-06-09T14:46:57.000Z

just to confirm - to make this work for your use cases only version is needed and no other metadata?

Answer 11 · 2020-06-09T14:57:29.000Z

Yes, just the version is needed. We don't use timestamps at all.

Answer 12 · 2020-06-09T15:16:31.000Z

Confirmed, version would make exports usable for editing projects. I can't think of a reason I'd want other meta-data on tag-less nodes, and I'm sure any reason I eventually think of won't justify the cost.

Answer 13 · 2020-06-10T12:02:17.000Z

changed location values from a 64 bit integer to a 96-bit struct that includes the version

AWS i3.xlarge: osmx expand planet.osm.pbf planet.osmx took 7.38 hours and resulted in a 666G planet.osmx file. so another 3-5% bump in expand time and planet size. need to verify now that this is correct and benchmark some extracts, because the page fault rate when accessing locations should be higher now.

Answer 14 · 2020-06-10T13:36:58.000Z

so another 3-5% bump in expand time and planet size

That seems pretty reasonable. For the augmented diff use case #17, version information is useful for the same reason as @invisiblefunnel mentioned above, it allows for unique identification of a particular node in order to match it to its metadata.

the page fault rate when accessing locations should be higher now

Can you describe this a bit more?

Answer 15 · 2020-06-10T14:00:43.000Z

the page fault rate when accessing locations should be higher now

Locations were previously stored as 64 bit integers. The records for the "Locations" table in the osmx file occupy contiguous pages of storage on disk, ordered by node ID. Adding a 32 bit version number increases the record size by 50%, so less records fit on a single disk page.

When osm extract is run, a way's member nd references are resolved into lat/lng by seeking over the locations table; in order of increasing way id. This has very poor locality; extracting Boston might include ways 12345 and 12346, but ways 12345 and 12346 might reference nodes anywhere from 1 to 1000000; the node ID is essentially random (unless it's a set of ways and nodes that were all created around the same time and not edited heavily)

the osmx design (by using lmdb) implements no application level caching. it relies on the kernel to cache pages as they are retrieved from disk. This is tuned to automatically manage a pool in RAM of cached disk pages. Since the locations table is now less dense, it's more likely when fetching Locations that you will need a page that has not been fetched yet or has been evicted from cache.

This is just my performance hypothesis, I need to run some benchmarks to determine whether or not it makes any significant difference.

Answer 16 · 2020-06-11T03:34:51.000Z

Here's my test region:

osmx extract planet.osmx benchmark.osm.pbf --bbox 38.462,-77.519,41.0130,-73.333

first run on versionless planet: 943 seconds
second run: 919 seconds

version planet: 873 seconds
version planet 2nd time: 773 seconds

echo 3 > /proc/sys/vm/drop_caches can be used to clear the page cache, but the extract is probably big enough so that it doesn't make a difference. This isn't a very controlled experiment because the versionless planet has been being updated for a few weeks and might be more fragmented. In any case, it doesn't look like adding versions to locations negatively affects the speed by that much.

Answer 17 · 2020-06-11T04:00:11.000Z

@blackboxlogic @invisiblefunnel new planet with versions is now online - can you try on https://protomaps.com/extracts ?

Answer 18 · 2020-06-13T22:35:56.000Z

Works perfectly for me. Many thanks @bdon.

Answer 19 · 2020-06-14T12:24:53.000Z

Every element has a version number, so the extracts are usable for editing.
Tag-less nodes don't have edited at, which is expected. However, I'm noticing that edited by and changeset are both 0 for all objects. Is that intentional?

Answer 20 · 2020-06-14T12:33:49.000Z

Yes, the data is stored but I intentionally am excluding it. That seems to be the convention for GDPR compliance. Is that needed for any of your applications?

Answer 21 · 2020-06-14T12:56:59.000Z

I definitely don't need it but it could be plausibly useful* and if you're storing it already then there isn't much to gain by withholding it. Other services handle GDPR by offering the "pii" only to OSM users who have signed in with oAuth, since they have agreed to terms of service. That would, of course, complicate your service by involving oAuth.

*Possible use-case: A vandal changes all buildings into parks, I want to remove all leisure=park where [vandal's name] was the last editor. I've had to do this sort of thing a few times.

Answer 22 · 2020-06-14T15:11:50.000Z

I have an auth system built which is separate from osm Oauth. I could make PII only available to logged in users.

Can you describe your editing workflow in more detail ? I’d like to include it in my SOTM talk and I can mention your username if that’s ok.

Answer 23 · 2020-06-14T16:01:07.000Z

Re: "Describe your workflow"
Short version: "I've build up a collection of scripts which can be chained together" but I think that's just called "programming"?
Here's a recent example of my work, but I plan to do more and there are two parts of my pipeline I'm rewriting (pulling data from OSM, and schema translation). One of the more cumbersome parts was retrieving up-to-date large regions of data from OSM. It was awkward for multiple reasons and my future projects will benefit from your work.
If you want a longer description shoot me an email [blackboxlogic at gmail dot com] with your phone number and best time to call, I'd love to chat.

Yes, "ok" to mention my username.

Answer 24 · 2020-06-15T02:26:46.000Z

Great, we can discuss over email.