precomputed annotation format related ID lookup

Question

precomputed annotation format related ID lookup

Closed this issue 4 months ago · 2 comments

I'm wondering if there is a path forward to allow users to select segments based on interactions with pre computed annotations.

The present precomputed annotation format doesn't store the relationships with the annotation data, so it's hard for the UI to allow users to identify and select related segments based on interacting the the annotations. The related ID index allows for an efficient way to query all the annotations associated with a related segment, but not which related segments are associated with an annotation. This presently works for local annotations.

The simplest solution I think would be to simply encode the relationships along with the other properties in a "v2" format. I don't know if you have other/better ideas. The present v1 approach is highly data duplicative, and has the advantage of being a fixed length of bytes per annotation and this would break that convention.

Answer 1 · 2024-02-13T20:19:42.000Z

Yes, currently you have to make a separate request to the by_id index per annotation in order to retrieve the list of all related segments, because only the by_id index stores that information. It would be reasonable to store the relationship data in the other indices if there were a use for it, but that would indeed be a format change.

Regarding a v2 format, there are a few thoughts I had on that:

It may make sense to use Parquet or similar arrow-related format for encoding each chunk rather than the custom binary format currently used, but I have not investigated that too much. I don't think Parquet is particularly suitable for representing an entire index, but I could be mistaken.
It would be nice to allow indices to be defined on arbitrary (ordered) combinations of geometry, relationships, properties.
It would be nice if there were an existing database format that could be leveraged (i.e. designed to be read directly without a server over high-latency storage, can be written via batch process also without a server) but unfortunately I don't think there is.
OCDBT could be used in place of precomputed sharded format, as that would allow ordered indices over arbitrary strings rather than just hash indices over uint64 values.

Answer 2 · 2024-02-14T14:17:54.000Z

I realized that part of my confusion was i had a bug in #522 which was not writing the by_id index right, so the related segments were not showing up for my layers. I fixed that, so thank you for clarifying so i realized my bug. Comments on a v2 format make sense and I whole heartedly agree. We've been discussing what format to write old versions of materialized data from CAVE to and many of these same issues came up in that discussion.