Arrays that span multiple blocks
eslavich opened this issue · 15 comments
Isn't this basically supporting chunking?
Yes it is.
I guess the question becomes whether the chunks should be in separate blocks. Yes, if they are compressed; is that a common usage? I worry that a large number chunks listed in the YAML will affect performance. I should see how zarr does it as an example, unless you already know.
I have not thought about the implementation much. My impression is that the current block format is already a bit complex (but still straightforward and easy to encode/decode), and adding a chunking layer on top of this would just increase complexity. My idea (at the time) was to define my own chunking mechanism (outside of the ASDF standard) on top of these blocks, and this would be done in YAML. A "chunked block" definition would describe the overall array shape/size, the chunking granularity, and have pointers to blocks.
Maybe chunked arrays could be stored as arrays where the entries are chunk ids? These arrays can then be stored either as yaml or as blocks.
Compression (or checksumming) seem like a good feature to have. zlib and bzip2 by themselves don't compress much, but there are compression algorithms that are adapted to the floating point representation and which perform much better.
Given the fact that zarr supports compressed chunks, I don't think there is any sensible way of having the chunks within or or many asdf binary blocks that would be efficient. Zarr handles it by putting each chunk in a separate file (or using a database to handle the chunks) and appears to be the only practical solution given unknowable size of chunks if they are writable. It would be possible to store all the chunks in an ASDF file so long as no compression is used, or it is only read only (e.g., an archived file). But as a working file that can be updated, the zarr approach is the only practical one. I'm going to think a bit more about how we could leverage zarr efficiently. I don't think its approach precludes support in other languages.
The use case that interests me and many of my collaborators are immutable files. These files are written once, and if they are processed, they are unchanged; instead, additional files are created. That is, instead of treating a file as a mutable database, it is treated as a snapshot. Thus chunks that change in size or number (arrays that are written, resized, or new arrays that are added) are not relevant. However, pointers between files are quite convenient to have.
I don't know how large a fraction of the community would use ASDF the same way.
I think both ways are useful. Immutable data could be supported as well saving data within the file. I'll see if I can come up with an outline for both approaches that can use zarr for the interface to the data.
@eschnett we've been discussing this quite a bit internally to come up with proposals for how to deal with these kinds of cases and are beginning to firm up our ideas on this. First I'm going to post soon a proposal regarding how to handle extended kinds of compression first since chunking implementations will be layered on that.
This is a must have for RST, given the large variables.
On a related performance note, block processing will be important - (not loading full variables into memory for processing). I can make this a separate issue, if desired.
Block processing (i.e. traversing large arrays block-by-block, or traversing only a caller-specified subset of an array) is part of the API, not the data format. That is, isn't that a question for an implementation, not the standard?
It could be either. If using chunking, it would be the data format, otherwise, the API. That's one reason to understand which is being asked for. (I tried to raise the issue in a previous comment but it wouldn't allow me at the time). API options are memory mapping or reading in a range of a block.
Yeah, I am advocating for both (though I can see where block processing in the API should be raised elsewhere - I only noted it here due to it being somewhat related).
Both being chunking and the other options? Or just the last two? For RST I'd say chunking is more consistent with the cloud model.
Chunking & block processing. You are correct that chunking is more consistent with the cloud model, whereas both chunking and block processing are important for data manipulation outside of the cloud.