Investigate string pools
Closed this issue · 5 comments
Curious, have you considered using a string pool for storing frequent tags? Currently, OSMExpress stores all tags as :List(Text)
, but looking at taginfo I wonder if it might be worth representing the 64K most frequent tags as 16-bit integers. The numeric tag IDs might get assigned when an OSMExpress database is initially getting built from a planet, and never change during the database lifetime. If anyone happens to give this a try, I’d be curious about how much space this would save in practice. Of course it would make the codebase more complicated, also for clients who just want to decode an OSMExpress database. So, as always, there’d be a tradeoff.
Yes, I attempted it at the beginning. The storage space savings were 5-10%, which is quite a bit, but OSMExpress eats lots of disk anyways, so I decided it wasn't worth making the code more complicated :)
I followed the same scheme imposm3 uses which is to encode common tags in the unicode Private Use Area: https://github.com/omniscale/imposm3/blob/master/cache/binary/tags.go#L140
Have you tried :List(UInt16)
as a sibling to :List(Text)
? The Unicode PUA character hack needs 4 bytes per tag, whereas an explicit integer only 2. However, one would lose the original tag ordering. Interestingly, imposm3 uses only 166 common tags, so a :List(UInt8)
would do. Intuitively, I’d have used a larger set of common tags (64K), but of course one would have to measure the size impact.
I didn't investigate it further, maybe I will when I get a chance to look at this again (and likely migrate to flatbuffers instead of capnproto). Compact storage isn't really a design goal of this system in favor of keeping a small and maintainable codebase.