asdf-format/asdf-standard

Document that the block magic sequence is invalid UTF-8

eslavich opened this issue · 1 comments

Unless I misunderstand the YAML spec's section on characters, all the bytes in our current block identifier sequence are valid in a YAML document:

d3 42 4c 4b

If this is true, then should we consider changing one of these characters to be outside of the YAML valid set? Doing so would allow us to seek through the ASDF file to find the first block without first parsing the YAML section.

The ASDF Standard requires that the tree be encoded in UTF-8:

ASDF is a hybrid text and binary format. The header, tree and block index are text, (specifically, in UTF-8 with DOS or UNIX-style newlines), while the blocks are raw binary.

and the block identifier sequence is in fact invalid UTF-8, since 0xD3 must be followed by a byte in the range 80..BF (see table 3.7 in the unicode standard).

So it should be possible to seek to the first block by looking for this sequence, but maybe we need to better document that fact. I'll change the title of this issue accordingly.