mapbox/geobuf

Geobuf format and streaming writes

Opened this issue ยท 20 comments

joto commented

/cc @mourner @springmeyer @artemp

The problem

The geobuf format as it is currently defined can't be written in a stream. Instead, the whole data has to be assembled in memory first and then written out. This directly follows from the way Protobuf encodes its messages. A good description of the problem can be found in the header comments of the protobuf writer of the UPB library. In short the problem is that Protobuf uses nested messages to encode the data and each message has a length header which is encoded as a Varint. But we can't write out the length (or even know how long the length field is, because a Varint is of variable length) before we have the whole message assembled.

This is, of course, a major problem for a format that is intended for huge files.

A possible solution: Remove outermost message

Because this is an inherent limitation of the Protobuf format we have to look outside the format for a solution. Of course we could throw away the whole Protobuf format, but thats not needed. Whats needed is a wrapper around it so that the Protobuf encoder/decoder only sees part of the data. In the simplest case we encode the data in pieces:

We remove the outermost message Data and then write each of the other data pieces on their own as their own protobuf message. We might want to move the keys, dimensions, and precision into a message Header or so. The oneof data_type construct doesn't work any more, instead we just have to keep reading messages until EOF. But we don't know which kind of messages will be in there (Feature, FeatureCollection, etc.) so we have to add this information to the newly introduced message Header in some way and then parse accordingly. This should certainly be doable and doesn't require a lot of change. But I think there is...

A better solution: Chunking the data

This looks slightly more complicated in the beginning but has many advantages, so bear with me. Lets encode the data in chunks, each chunk gets a length field and the data:

CHUNK
    LENGTH (4 bytes)
    DATA (LENGTH bytes)
CHUNK
    LENGTH (4 bytes)
    DATA (LENGTH bytes)
...

Each DATA block is a complete Protobuf message which can be parsed on its own. This idea is, of course, not new. It is what the OpenStreetMap OSM PBF is doing.

A typical DATA field will contain maybe a few thousand geometries or features. Note that this does not mean that the contents of the different chunks are somehow logically distinct. Logically this is still one data stream. The chunking is purely an encoding issue and files with the same data split up into different sized chunks would still represent the same data.

This format some added advantages:

  • the LENGTH field can tell us how much memory to allocate for buffering the DATA part
  • reading and writing can be done in parallel, because several threads can work on encoding/decoding different chunks at the same time. In fact thats what Libosmium does when parsing OSM PBF.
  • concatenating two files is trivial: deal with the headers, then just copy data chunks

Now it gets a little bit more complicated than that. (This is again based on experience with the OSM PBF format.) The first chunk should probably contain some kind of header. This could include metadata such as the dimensions setting and the keys. All other chunks contains the data itself. So chunks (OSM PBF calls them Blobs) need to contain some kind of type identifier to differentiate between a header chunk and a data chunk. OSM PBF does this by adding another level of Protobuf encoding (See BlobHeader and Blob messages) which seems like overkill to me. It makes the implementation rather confusing and probably slower. Instead we can just add a type field:

CHUNK
    HEADER (fixed size)
        TYPE (1 byte, first chunk always =META)
        LENGTH (4 bytes)
    DATA (LENGTH bytes)
CHUNK
    HEADER (fixed size)
        TYPE (1 byte, following chunks always =GEOMDATA)
        LENGTH (4 bytes)
    DATA (LENGTH bytes)
...

Strictly speaking we can live without that TYPE field, because the header always has to be the first chunk and following chunks data, but it seems cleaner and allows us more flexibility if we have this type. And maybe we want to have different types of data? This is something which has to be explored.

OSM PBF adds another useful functionality: Encoding chunks (or Blobs) with zlib or other compression formats. Each chunk can be optionally compressed and the type of compression is noted in the chunk header. This can squeeze out the last bytes from the resulting files, but it is still possible to encode and decode the file in chunks and in parallel. To add this we need, again, some type field. And we should also store the size of the uncompressed data, because it allows us to give the decompressor a buffer with the correct size.

Note that I have used 4 byte length fields in my description. This is probably enough for each chunk, in fact chunks should not get too big, because each one has to fit into memory after all (several is we encode/decode in parallel). OSM PBF has some extra constraints on sizes of different structures which can help with implementations because fixed-sized buffers can be used.

Note also that there is no overall length field for the whole file. Thats important, because it allows use to streaming write the data without knowing beforehand how many features the file will contain or how big it will be. (We might want to end the file with some END chunk that marks the end of file to guard against truncation. Optionally it could contain a checksum. This is something that OSM PBF is missing, but could be useful to detect data corruption.)

And while we are at it, I suggest adding a fixed 4-byte (or so) magic header that is always the same but can be used by tools such as find to determine the file size easily and a fixed-size version field for future-proofing the format.

This brings us to something like this:

MAGIC (fixed size)
VERSION (=1, fixed size)
CHUNK
    HEADER (fixed size)
        TYPE (1 byte, first chunk always =META)
        COMPRESSION_TYPE (1 byte)
        RAW_LENGTH (4 bytes)
        ENCODED_LENGTH (4 bytes)
    DATA (LENGTH bytes)
CHUNK
    HEADER (fixed size)
        TYPE (1 byte, first chunk always =GEOMDATA)
        COMPRESSION_TYPE (1 byte)
        RAW_LENGTH (4 bytes)
        ENCODED_LENGTH (4 bytes)
    DATA (LENGTH bytes)
...
CHUNK
    HEADER (fixed size)
        TYPE (1 byte, last chunk always =END)
        COMPRESSION_TYPE (1 byte)
        RAW_LENGTH (4 bytes)
        ENCODED_LENGTH (4 bytes)
    DATA (LENGTH bytes)
        CHECKSUM

Some padding might be necessary to have length fields on 4-byte boundaries etc. And all length fields should probably be encoded in network byte order. Those details can be worked out.

Inside the DATA we'd still use the Protobuf-encoded data (nearly) as before. No big change there. For some things such as the keys we have to discuss whether they fit better in the META header or the DATA part. In the META and END blocks we can use Protobuf, too, or any other encoding. Because thats not a lot of data it isn't that important that we pack it so tightly and using a simpler format might allow simpler access to the metadata. On the other hand Protobuf is tried and true and allows for easy extensibility.

tmcw commented

๐Ÿ‘ awesome, thanks for writing this out

Wow, that's a lot of information to process. Thanks for the detailed write-up, I'll follow up with comments.

@joto Chunking, yes... I've been thinking about this in the context of GeoJSON: https://github.com/geojson/geojson-feature-sequences. Might be useful for geobuf, yes.

Thought about this for a while... Before we decide whether to chunk or not to chunk, there's another problem to solve.

One of the Geobuf requirements is lossless GeoJSON compression, meaning that Geobuf should retain the full structure of the original file to be able to recreate it. So we can't limit Geobuf to a particular sequence of data types.

GeoJSON is arbitrarily nested โ€” FeatureCollection can contain a Feature which contains GeometryCollection which contains other geometries, some of them might be GeometryCollections as well etc. It's a big problem for streamable writing if we encode nesting by means of embedded messages. So there's an obvious alternative โ€” completely flatten the structure, encoding things sequentially, and add additional fields that we can use to recreate the tree. Several ways we could do that:

  1. Autogenerate internal ids for each "parent" object and then reference the ids from children.
  2. Reference parents by index each parent is encoded in the message (e.g. obj1,obj2,obj2_child would have parent field 1, referring to obj2).
  3. For each parent object, indicate how many children follow in the encoded message.
  4. Your idea.

Thoughts?

@mourner first off, my bias: I was hoping to use geobuf as a transport and possibly storage format for smallish (maybe a few hundred MB or less) simple features (no arbitrary nesting). These would all be coming from other GIS formats with much flatter data structures. I wasn't thinking of the chunked streaming case as well described by @joto since I was thinking we'd subdivide data across geobufs as needed, and deal with associations between them out-of-band. The solution proposed by @joto looks like this would allow geobuf to go MUCH bigger, without too much overhead for the case of small features (depending on chunk size), so I'm not at all opposed if it can be done without losing too much of the existing compression benefits of geobuf or adding significantly to the implementation complexity.

Some things to consider:

  1. lossless does not have to mean support for all possible variants of GeoJSON; it should mean that for whatever variants are supported, no information / precision should be lost along the way, and unsupported variants rejected outright. At present, I don't think it would be unreasonable to limit geobuf to a well-stated level of collections within collections (e.g., 1-2 levels deep), and stop at that. But I'm biased.

  2. How would you allow child collections to be split across chunks? If you don't allow them to split, then this may make the chunks arbitrarily big; if you do split them, you have to retain additional information about prior chunks as you are stream processing each new chunk (to look back and find the parents). I think your 1-2 could work in this case, but may require indexing a lookup to the parents or keeping them entirely in memory. This problem becomes much easier if each chunk is self-contained and doesn't need to traverse cross-references to other chunks.

  3. there may be some ideas lurking in Google's Dremel approach (summarized here) that may be useful here with respect to nested data structure issues.

  4. with respect to chunking, the terminating chunk can possibly be simplified down to an empty message, similar to HTTP chunked encoding

@sgillies how are you planning to handle arbitrary nesting of collections in GeoJSON feature sequences?

joto commented

I don't think this arbitrary nesting is that useful in practice. Can you cite any examples where this is used in some real-world application? Because I can't see how this rather complex format will actually be used productively, I can't really see how best to implement it (and whether it is actually worth implementing).

Here is another way of seeing this: GeoJSON could have been written as two standards: One that describes how to encode Simple Feature Geometries in JSON, the other how to assemble those geometries plus extra information into higher level structures. The first part is useful without the second and is the base for the second at the same time. GeoJSON didn't do it that way, but maybe we could.

We could have one standard that describes how to efficiently code Simple Feature Geometries and then (possibly multiple) further "standards" that describe how this can be used to describe larger structures. This way we can standardize now what we actually need for our own uses and later have further standards if the need arises. Those family of standards would allow some code reuse but at the same time not burdon us now with a complexity we are not sure we'd really need.

@brendan-ward the only thing I am sure about with respect to arbitrary nesting of geometry collections is that serializing nested geometry collections and serving them to users shows an indifference to other developers verging on malice. Multi-linestrings and multi-polygons are useful in our models so that we can do antimeridian cutting of geometries in geographic coordinates while continue to attest that there is a single geometric object. Generalized geometry collections exist in GeoJSON because they existed in OGC simple features, and they exist in the OGC standard because it was written by people who crave complexity and can't say "no." I think geobuf can say no to data like

{ "geometry": 
  { "type": "GeometryCollection", 
    "geometries": [
       { "type": "GeometryCollection", 
         "geometries": [ 
           { "type": "Point", "coordinates": [0, 0] } 
         ] } ] } }

if we want.

Revisiting the issue again. Some points:

  • Agreed with @brendan-ward and @sgillies that we don't have to support unlimited level of nesting โ€” we can say no to anything below 1-2 levels. However my feeling is that if we do implement 1-2 levels nesting with a flat structure and reference fields, the implementation will be inductive in the sense that if 1-2 levels work, more levels will work with the same approach, so limitation would be artificial, not technical.
  • I'm not in favor of chunking because if we want to win with Geobuf, we need to put our stakes on simplicity. Chunking creates complexity that I'd like to avoid. If we can support easy streaming reads/writes by just flattening the structure and avoiding deeply nested messages, I'd go for that instead because it's much simpler.
  • Even if we limit nesting to 1-2 levels, we still need to decide about technical approach to implementing flattened encoding of features. So my comment #37 (comment) remains relevant still needs feedback.

@mourner from a purely streaming perspective, I think your option 2 wins for simplicity, so long as one tracks the current state while parsing from geobuf and builds an index of features by order in the buffer; then looking back to the parent is easy, and doesn't require extra storage in the parent to accomplish this like in option 1.

Perhaps an even easier way to approach this when the tree is encoded depth first is simply to use the state of a variable (let's call it parent) with one of 3 states to indicate nesting level (absent means no parent, switch between 0 and 1 for each level). Each new value of 0 or 1 indicates it is the first child and previous node is the parent. Take this example, with more nesting than I like:

   A        G
  / \         \
 B    C         H
     / \ 
   D    E
         \
          F

Could be encoded as:
A: (parent is absent)
B: parent: 0 (this means look back at previous node, which is parent)
C: parent: 0
D: parent:1 (since this is not 0, this means it is not a continuation of children under the first, and thus points at the previous node as its parent)
E: parent:1
F: parent:0 (since this is not 1, this means previous node is parent)
G: (parent is absent)
H: parent:0

So, nesting could be accomplished with a single optional bool rather than a uint.

@brendan-ward nice approach, but considering that boolean takes the same amount of bytes as small varint, I'd go with something more explicit.

@mourner explicit is almost always better. I thought we were trying to cover streaming issues for large datasets, so I figured that a boolean would end up being smaller in the buffer in the long run. But as a total percentage of the overall buffer size, either is likely to be tiny so this is sort of irrelevant. Depends a lot of the number of features and ratios of parents to children.

Were you thinking of storing an index of features by order in buffer during parsing, then looking back into that to retrieve parents? That was the issue I was struggling with and tried to deal with during the above approach; if you encode depth first, then you only ever need to look back at the previous node or parent - though recursion is involved... I was thinking it would lend itself to emitting the stream of data from the buffer as well, so that you don't have to hold onto things in memory beyond the previous node or parent until you start recursing. When encountering G in the example above, because there is no parent, then the entire set of A-F could be written to the output stream at that point.

But - I'm not the one to argue for complexity when simple & explicit will get it done.

@brendan-ward I think we can have the same wins with a more explicit encoding of nesting. Instead of recursion, we can just keep a stack of parents, which would always be tiny since there are no more than 1-2 levels of nesting, and we wouldn't have to store all nesting fields in memory. When we encounter a feature that has a different parent reference than the last one in the stack, we pop the stack and can output the corresponding object to the stream.

@brendan-ward attempting to implement a proof of concept now, and came up with a much simpler approach. Basically, we assume two limitations:

  • geometry collections can't contain other geometry collections (because wtf)
  • objects are encoded in the same order they flow in GeoJSON

Then we can derive the structure by just keeping 3 references โ€” last parsed feature collection, feature and geometry collection. Then:

  • if we encounter a feature, then our last parsed feature collection will be the parent if it's present
  • If we encounter a geometry, then the parent is either the last parsed geometry collection or the last parsed feature. We solve the ambiguity by resetting the geometry collection reference when a feature is encountered.

The limitations are sensible, the approach should be really easy to implement, and more importantly the performance will be stellar โ€” no need to keep stacks, etc.

@mourner sounds very reasonable to me! ๐Ÿ‘

A work in progress happening in #51. I think the new schema is pretty cool โ€” it's much simpler than what we had before while having the same compression. After the revision is done, I'll make a separate repo geobuf-stream for a streaming implementation.

Hey there,

Almost 2 years later, is there any workaround / temparary solution to use geobuf as a stream right now ? Even a beta implementation that I can test ? I was hoping than this issue was "forgotten" and adressed by an implementation somewhere, but I coun't find it so just asking :D

@cyrilchapon for now, a good workaround is to simply encode each feature separately and then do a length-prefixed stream, e.g. with https://github.com/mafintosh/length-prefixed-stream

Hi @mourner , thanks :)

Hum, sound promising, but way above my skills.

I'm trying to find an alternative to a dynamic, massive, GeoJSON download.

I'm currently streaming from database and piping to an http response like this :

      let geojsonStream = jsonStream.stringify(
        '{"type":"FeatureCollection","features":[',
        ',',
        ']}'
      );

      //then later
      coordinatesDatabaseStream
      .pipe(coordinateToFeature)
      .pipe(geojsonStream)
      .pipe(res);

I had in mind that I could just trade my GeoJSON encoding (through json-stream) with a protobuff-write stream (geobuf style), and just pipe that stream to my response

      coordinatesDatabaseStream
      .pipe(coordinateToFeature)
      .pipe(geobuf.encodeAsStream)//<- the part I'm asking here
      .pipe(res);

Any chance you could provide an example, or an idea how to implement what you were suggesting, for this use-case ?

I'm coming to this extremely late, but a couple of comments:

Don't support nested geometries

@mourner wrote:

GeoJSON is arbitrarily nested โ€” FeatureCollection can contain a Feature which contains GeometryCollection which contains other geometries, some of them might be GeometryCollections as well etc.

The specification says:

To maximize interoperability, implementations SHOULD avoid nested GeometryCollections. Furthermore, GeometryCollections composed of a single part or a number of parts of a single type SHOULD be avoided when that single part or a single object of multipart type (MultiPoint, MultiLineString, or MultiPolygon) could be used instead.

Furthermore, a FeatureCollection can't contain other FeatureCollections.

(Feature "properties" objects themselves can be arbitrarily nested, I guess, but I don't think that causes a problem here?)

So it's entirely reasonable for the Geobuf format to avoid arbitrarily nested geometries files, which probably never occur in reality. (I'm not sure I've ever seen even a single GeometryCollection, let alone a nested one).

Consider a geobuf equivalent to newline-delimited GeoJSON?

I'm a bit in love with newline-delimited GeoJSON, which basically removes the outer FeatureCollection layer, and puts one feature per line. Perhaps there's a GeoBuf equivalent, as hinted above (removing the outer Data layer, although possibly requiring some new header).

(Hmm, I just realised there is #51 so all of this is probably very redundant now anyway :))