beacon-biosignals/OndaFormat

explore usage of arrow files instead of msgpack.zst

jrevels opened this issue ยท 8 comments

now that Arrow is v1.0 and there's a nice Julia package for it

I would recommend calling the disk format Parquet because that's the actual disk format. Arrow itself is an abstraction that can work with various disk formats, including the now obsolete Feather and Parquet. The implementations in R and Python make this distinction clearer. ๐Ÿ™

I think they are different (arrow-on-disk vs parquet), unless https://stackoverflow.com/a/56481636 is out of date?

Ugh, this explains part of the mess I've had in roundtripping output from Arrow.jl to any of the other languages. I've had essentially no problem moving data between Python and R, but moving between either of those to Julia has been a source of frustration.

But the main takeaway is the same: Arrow is primarily an in-memory format/protocol and so any talk of "Arrow files" needs to be really explicit about what's going on.

Also, things without native types in other languages can wind up really messed up, e.g. not all languages have unsigned types. There's an obvious fix for the readers for those languages, but I suspect there will be a few lingering compatibility issues (although that may be less an issue for Onda).

any talk of "Arrow files" needs to be really explicit about what's going on.

To be clear, whenever I refer to "Arrow files" I mean https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format.

I did a super naive little experiment to make sure that basic Onda metadata was (de)serializable via Arrow.jl; can find it here (note that I'm also playing with some of the other v0.5.0 changes there too). So it seems like switching to Arrow.jl is definitely doable and seems like it'd even yield slight performance improvements.

It'd be nice if Onda's metadata was more columnar in nature - when the data is amenable, migrating from fewer columns/more nesting to more columns/less nesting is generally a win from an accessibility perspective (since it makes it easier to naively express finer-grained accesses). Unfortunately, it really is the case that most of the Onda metadata is pretty "record-like" in that you really do want to read most of a row pretty often. I guess it'd be worth breaking up annotations and signals into separate columns, at least...I'll play around with it.

The only roadbump that I really encountered was apache/arrow-julia#75, though it's pretty workaroundable; could just go back to Vector{Annotations}, for example. Tangentially, I've also been debating about whether we'd want to switch to something like Dict{UUID,Annotation} anyway. Internally at Beacon, we canonically associate a UUID with our annotations, which has been globally useful enough that it might be worthwhile to actually bake into Onda itself.

EDIT: After thinking about the above, I think the fields most likely to be individually predicated against (e.g. in reductions/searches) are signal names, channel names, and annotation values. Sample rates and time spans are next after, and I'd say the least likely thing to be individually predicated against is plain encoding information (which will likely just be used when sample data is loaded). In many senses, it'd be really, really nice to have annotations in their own table and replace the current "value" field with "the dataset author can include whatever additional columns they want alongside the recording UUID and time span columns." However, I'm not sure if the complexity of having multiple files is really worthwhile. I wonder if I can just store two tables alongside each other in a single Arrow file...keep going back and forth between separate tables and just continuing with the nested struct option used in my example lol. Theoretically, IIRC Arrow's struct layout technically enables similar field access optimizations either way, but it would be nice to keep things less nested if possible...will keep playing around with it.

Okay - I've implemented two primary alternative approaches here that I'd like input on.

  1. The "nested"/"single-table" approach.. In this approach, we have a single table: recordings.arrow (each row is a single recording) with 3 columns (uuid, signals, annotations)

  2. The "flat"/"muli-table" approach. In this approach, we have two tables: signals.arrow (each row is a signal) and annotations.arrow (each row is an annotation).

Thoughts:

  • Both approaches are better than our current approach from a performance and tooling interop perspective.
    • both approaches allow signals/annotations to be deserialized separately
    • both approaches allow annotation values to have well-specified custom structures (not just strings!)
    • Arrow is simply a more featureful (and more rigorously specified, IMO) format than MsgPack, and is more likely to have "built-in support" in many common data engineering tools moving forward
  • Approach 2 is more denormalized than Approach 1, with all the tradeoffs that entails:
    • easier to build arbitrary indices/partitions (but more annoying if you really did want the data indexed by recording in the first place)
    • fewer structurally enforced constraints (allows writes/appends to be more decoupled)
    • allows/forces annotations/signals to be downloaded separately
    • due to the previous points, this means the approach is amenable to treating both signals and annotations like multi-file tabular datasets
  • Approach 1's index contains a nice row-cache-y optimization that will probably be beneficial for many common Onda workloads. I could (and would probably have to) build a similar indexing utility on top of Approach 2, but it might be slower/a tad more complicated. Probably worth trying out.
  • Approach 2 exposes SignalsTable/AnnotationsTable types that implement the Tables.jl interface (mainly via delegation to the underlying Arrow.Tables) and could have interesting specializations defined on them in the future

I'm leaning towards implementing an Approach-1-like index on top of Approach 2, and seeing how much slower/more work it is. If the result is feasible/ergonomic, then it seems to me like Approach 2 is the way to go. Otherwise it'll be a tougher call...thoughts?

I'm leaning towards implementing an Approach-1-like index on top of Approach 2, and seeing how much slower/more work it is. If the result is feasible/ergonomic, then it seems to me like Approach 2 is the way to go. Otherwise it'll be a tougher call...thoughts?

Have been playing around with this all day, and now feel pretty satisfied with the implementation here ๐Ÿ˜Ž I'm still a Tables.jl noob so some of those delegations might not be ideal (and the by_recording implementations could probably be more elegant lol), but the idea seems tenable at least.

I'm going to start test-driving this on Beacon-internal data later this week and will report back.

kolia commented

Just curious, any idea why the by_recording calls are ~5-10x faster in the alternative approach at the bottom?

I'm assuming that if we write using the streaming format, we can append to existing tables?

Just curious, any idea why the by_recording calls are ~5-10x faster in the alternative approach at the bottom?

Yup, making those faster was actually the goal that motivated the implementation change. The alternative approach wraps whatever row types are produced from Table.rows on the underlying table, so e.g. "lazy" loading characteristics are preserved. In contrast, the old approach forcibly "materialized" rows into specific structs upon iteration.

The big difference here is that this alternative approach better respects caller/table constructor decisions so that the caller can use the right underlying (or overlying!) table type for their use case