osirrc/ciff

Document gap compression

lintool opened this issue · 6 comments

We need to explicitly document that docids are gap compressed, both in README and in the protobuf definition (i.e., in comments).

Indeed, this is not clear from the protobuf definition.

This is a slightly odd one, because the gap compression only arises due to the way the Lucene export is engineered. So I guess are we going to assume that any other system which may want to export a CIFF should also be doing delta compression? In that case, we should definitely document it with the CIFF/protobuf definition.

On the other hand, there's nothing inherently in the definition of the protobuf which makes it necessary to store deltas. Thoughts?

I think only the description should be updated. If systems are allowed to also export without storing delta's, a system has to know how the CIFF is constructed before reading it. It would be desirable to be consistent on how CIFF should be constructed given an index.

Jimmy's implementation of the Lucene index export adds in the delta gap (this isnt related to the Lucene index itself). Assuming its the defacto base, then readers and writers have to be aware of d-gaps. All of our impls now have d-gaps.

Arguably the name "docid" in the Posting object definition is what is wrong - if we were always going to use d-gaps, the name should have been different. As suggested in the OP, its documentation changes that are needed.

I've started a branch to work on some improved documentation: https://github.com/osirrc/ciff/tree/documentation

Please feel free to contribute.

Changes made.