/protobag

Primary LanguageC++Apache License 2.0Apache-2.0

Protobag: A bag o' Serialized Protobuf Messages

With built-in support for time-series data

Build Status

Quickstart & Demo

See this python noteboook for a demo of key features.

Or you can drop into a Protobag development shell using a clone of this repo and Docker; FMI see:

./pb-dev --help

Summary

Protobuf is a popular data serialization library for C++, Python, and several other languages. In Protobuf, you can easily define a message type, create a message instance, and then serialize that message to a string or file.

But what if you want to store multiple messages in one "blob"? You could simply use repeated and create one giant message, but perhaps you don't in general have the RAM for this approach. Well then, you could append multiple messages into one big file, and delimit the boundaries of each message using the number of bytes in the message itself. Then you'd have something that looks exactly like the infamous TFRecords format, which is somewhat performant for whole-file streaming reads, and has a very long list of downsides. For example, you can't even seek-to-message in a TFRecords file, and you either need a large depenency (tensorflow) or some very tricky custom code to even just do one pass over the file to count the number of messages in it. A substantially better solution is to simply create a tar archive of string-serialized Protobuf messages-- enter Protobag.

A Protobag file is simply an archive (e.g. a Zip or Tar file, or even just a directory) with files that are string-serialized Protobuf messages. You can create a protobag, throw away the Protobag library itself, and still have usable data. But maybe you'll want to keep the Protobag library around for the suite of tools it offers:

  • Protobag provides the "glue" needed to interface Protobuf with the fileystem and/or an archive library, and Protobag strives to be fully cross-platform (in particular supporting deployment to iOS).
  • Protobag optionally indexes your messages and retains message Descriptors (employing the Protobuf "self-describing message" technique) so that readers of your Protobags need not have your Protobuf message definitions. One consequence is that, with this index, you can convert any protobag to a bunch of JSONs.
  • Protobag includes features for time-series data and offers a "(topic/channel) - time" interface to data similar to those offered in ROS and LCM, respectively.

Batteries Included

Protobag uses libarchive as an archive back-end to interoperate with zip, tar, and other archive formats. We chose libarchive because it's highly portable and has minimal dependencies-- just libz for zip and nothing for tar. Protobag also includes vanilla DirectoryArchive and MemoryArchive back-ends for testing and adhoc use.

If you want a simple "zip and unzip" utility, Protobag also includes those as wrappers over libarchive. See ArchiveUtil.

Development

Discussion of Key Features

Protobag indexes Protobuf message Descriptors

By default, protobag not only saves those messages but also indexes Protobuf message descriptors so that your protobag readers don't need your proto schemas to decode your messages.

Wat?

In order to deserialize a Protobuf message, typically you need protoc-generated code for that message type (and you need protoc-generated code for your specific programming language). This protoc-generated code is engineered for efficiency and provides a clean API for accessing message attributes. But what if you don't have that protoc-generated code? Or you don't even have the .proto message definitions to generate such code?

In Protobuf version 3.x, the authors added official support for the self-describing message paradigm.
Now a user can serialize not just a message but Protobuf Descriptor data that describes the message schema and enables deserialzing the message without protoc-generated code-- all you need is the protobuf library itself.
(This is a core feature of other serialization libraries like Avro).

Note: dynamic message decoding is slower than using protoc-generated code.
Furthermore, the protoc-generated code makes defensive programming a bit easier. You probably want to use the protoc-generated code for your messages if you can.

Protobag enables all messages to be self-describing messages

While Protobuf includes tools for using self-describing messages, the feature isn't simply a toggle in your .proto file, and the API is a bit complicated (because Google claims they don't use it much internally).

protobag automatically indexes the Protobuf Descriptor data for your messages at write time. (And you can disable this indexing if so desired). At read time, protobag automatically uses this indexed Descriptor data if the user reading your protobag file lacks the needed protoc-generated code to deserialize a message.

What if a message type evolves? protobag indexes each distinct message type for each write session. If you change your schema for a message type between write sessions, protobag will have indexed both schemas and will use the proper one for dynamic deserialization.

For More Detail

For Python, see:

  • protobag.build_fds_for_msg() -- This method collects the descriptor data needed for any Protobuf Message instance or class.
  • protobag.DynamicMessageFactory::dynamic_decode() -- This method uses standard Protobuf APIs to deserialize messages given only Protobuf Descriptor data.

For C++, see:

  • BagIndexBuilder::DescriptorIndexer::Observe() -- This method collects the descriptor data needed for any Protobuf Message instance or class.
  • DynamicMsgFactory -- This utility uses uses standard Protobuf APIs to deserialize messages given only Protobuf Descriptor data.

Cocoa Pods

You can integrate Protobag into an iOS or OSX application using the CocoaPod ProtobagCocoa.podspec.json podspec included in this repo. Protobag is explicitly designed to be cross-platform (and utilize only C++ features friendly to iOS) to facilitate such interoperability.

Note: before pushing, be sure to edit the "version" field of the ProtobagCocoa.podspec.json file to match the version you're pushing.

 pod repo push  SCCocoaPods ProtobagCocoa.podspec.json  --use-libraries --verbose --allow-warnings

C++ Build

Use the existing CMake-based build system.

In c++ subdir:

mkdir build && cd build
cmake ..
make -j
make test

Python Build

The Python library includes a wheel that leverages the above C++ CMake build system.

In python subdir:

python3 setup.py bdist_wheel