anorth/go-dar

CID handling/round-tripping questions

Opened this issue · 3 comments

raulk commented

The DAR format ellides the CID of each block and mandates depth-first ordering to guarantee that every linked CID will have have been processed/seen previously (assuming non-sparse graphs). Blocks are expressed as [length, data] tuples. An optional index { cid: offset } is envisioned but not yet implemented.

Assuming no index is provided, it's not clear to me how this format (without an index) would preserve the original CID of each block for full roundtripping. If I have block A, which I refer to with CID CID(A), and I write this block to a DAR file, how would a reader derive CID(A) from the data only? (assuming this block was not in the index nor a root).

Furthermore, links to other blocks get converted to file offsets which further complicates CID calculation. We'd need to populate the full graph, to then "unwind" it from the leaves to calculate the CIDs of intermediate blocks which contain links to other blocks.

raulk commented

I think reading a graph out of a CAR has not been implemented yet, which may be why this potential shortcoming is still hiding?

Thanks for this question. You're right that my implementation work paused (at end of a hackathon) just at the point of testing DAG extraction beyond the roots.

It's important to be clear about the concrete use case here. If writing a block that is not a root, then it must be the child of a node previously written to the archive. Without an index, the usual way one would get there would be by traversing through that parent node, which contains the full CID of the child. When traversing, these CIDs would be pushed onto a stack and so the usual case is that the top CID on the stack corresponds to the next block to be read. The CID prefix specifies the codec which should allow validation of the block. If all DAGs were complete, I believe this would be sufficient – let me know if you disagree at this point.

(You could get to an arbitrary block by skipping over the length bytes of previous blocks, but then you would be unable to validate the content. It's assumed that a use case that is randomly accessing blocks without deserializing the parents is using an index, either in the archive or external, and so knows the expected CID and hence encoding scheme. It's an explicit design trade-off here.)

However, in the case of a DAG with omitted blocks (like Filecoin sector CIDs), the desired CID will be further down the stack. So validation would entail decoding with the top CID, then if that fails, popping and trying again with the next one down the stack. If they have different schemes, attempting a new decoding every time does seem undesirable, and is an oversight in the initial design.

Note that if the CID prefix were somehow known a-priori, this stack popping is how I expect the DAG traversal to work. Compute the CID for the block, compare it with the top of the stack, and proceed down the stack until a match is found. If the stack is empty, either the block or the depth-first structure is invalid. I considered merely comparing the CIDs for equality to be of negligible cost, but I agree that decoding the block every time to compute a new CID would not be.

In practise, I would think that the vast majority of DAGs have nodes with a single scheme, or occasionally two. So, in practise, the redundant encoding effort might be close to zero. If you agree that the DAG can be validated in principle by working down the stack, I think it's worth considering not solving for this.

But, it would seem more robust to fix the problem. The most straightforward way would be to add the CID prefix bytes to the block preamble, so [prefix, length, data].

In another forum, you suggested this could be optimized to a lookup table in the header, and then a 1-byte reference to that table in the block preamble. One challenge with this is that in order to stream output, the complete header content needs to be known up-front, before writing any blocks; the table may not be efficiently knowable in advance. The table could go in the trailer instead, but then it couldn't be validated while streaming read.

Since we only need 2 bytes of the prefix anyway (multicodec and multihash, don't need multibase), I don't think it's worth trying to optimize even without those challenges. A extra byte for every block is a small cost. We could also optimize DAGs where all nodes share the same prefix, but again I don't think it's worth the complexity.

links to other blocks get converted to file offsets which further complicates CID calculation

This might be a misunderstanding. The links to other blocks are always represented in full. If two different parents link to the same child, that child's CID will appear in the archive twice, inside those parents. But in place of the second copy of the child block data, is an offset to the already-present copy earlier in the archive.

I did consider trying to optimize out those duplicated CIDs, but rejected it. The DAG should be certifiable, incrementally, while traversing top-down.