ipld/legacy-unixfs-v2

Reproducible File Imports

Stebalien opened this issue · 14 comments

I'd like to be able to encode the chunking algorithm/add options used inside a file's metadata. This would make it easier to reproducibly add files to unixfs (and verify them).

Use-cases:

  • Archival.
  • Verifying responses from a gateway. See: multiformats/cid#22
  • "Convergent file adding"(?). Basically, if I already have some large files contained in an IPFS directory tree, I'd like to be able to add them myself (locally) instead of having to download them from a peer (just to get the correct chunking/add options).

This shouldn't add too much additional metadata and will add no additional metadata for files under 256KiB (they'll fit in a single object).

Are existing definitions of chunkers standardized enough that we could rely on them or are we going to have to also specify that ourselves if we want to ensure they interoporate?

For instance, we know that when we chunk media files we should respect the keyframe/header boundaries for better seeking/range requests but I'm not 100% certain that there's a standardized way of describing that chunker.

Similarly, how reliable is Rabin across different implementations of the algorithm?

We'll have to come up with a naming scheme, unfortunately. This falls under the DEX project (ipfs/notes#204). Ideally we'd just point at a webasm program but that's probably not going to happen for a while...

Similarly, how reliable is Rabin across different implementations of the algorithm?

Uh... no idea?

Another thing to think about is how sharded file data is represented when it's too large for a single node. https://github.com/ipfs/unixfs-v2/pull/13/files#diff-916b3e1e005fd96e3a0546715235477dR38

If one implementation limits to 100 chunks per data array and compacts links starting at the first element of the array and another implementation limits to 1000 chunks and compacts into the end of the array backwards (it sounds awkward but it's probably more efficient to make access to the earliest parts of the file faster) then we also have a reproducibility problem.

We can either specify in great detail how this MUST be done in the data spec or we can try to define all the different methods people might want to use and add it to the metadata we're talking about here.

Yeah, I don't think we'll be able to come up with a clean, complete language for describing all possible chunking algorithms. I'd just like to record this information when possible.

Really, we'll probably record it in the file's general "metadata" section so we can figure out how exactly we want to do this later. We should just keep it in mind.

Really, we'll probably record it in the file's general "metadata" section

Any thoughts on whether or not we want to have a top level property specific to how the data field is constructed or should we just standardize a property inside the existing top level metadata property?

I'm just wondering if we want to create a separation between metadata that unixfs implementations themselves are adding vs metadata added by higher level actors. We have threads elsewhere pointing out that some people would like to store Content-Type metadata and I wonder if we're worried about conflicting in the same metadata namespace.

So, HTTP tried this with X- headers and, well, we all know how that went down. Basically, I'm not convinced we need to separate canonical from non-canonical.

Note: I'd still have a separate "metadata" section (separate from critical file information like size, etc.). However, I feel like forcing a canonical location and an "extra fields" location will lead to metadata duplication for compatibility.

For unixfs we will want to have extended attributes, we could store it as one. Then if file is downloaded using ipfs get the chunking information can be saved in filesystem if the filesystem supports extended attributes.

So, HTTP tried this with X- headers and, well, we all know how that went down. Basically, I'm not convinced we need to separate canonical from non-canonical.

Good point. Dropping the prefixing helps the migration from de-facto standard to actual standard.

Has anyone tried spec'ing this out from the opposite direction -- what user stories do we have that really demands having parameterized chunkers?

Have we spent enough time considering our options for staking out a position on the reductionist/simplicity side here?

I don't understand why we parameterize chunkers.

I think we shouldn't.

Ecosystemically, using various parameterizations of chunkers is an antifeature: it adds complexity, and using any more than one value of it anyway is a net loss for the entire system because it both breaks deduplication as well as causing these questions about.

I'm not aware of any other systems deployed in the wild which have significant usage and support parameterized chunking. Git doesn't. Venti didn't. I'm pretty sure from Backblaze, etc, blogs that they don't. Casync does, oddly, but it's new at this (and arguably, not designed for global pools, which changes things. You'd still never want to use casync commands with different chunking parameters and point them at the same storage pool, and iirc that's fairly loudly documented).

At most, changing major parameters like chunking algorithm should be treated as a migration, and handled extremely cautiously, because the cost of having more than one value active in the system is massive.

I can understand parameterized chunkers as a library; I can't as a tool with that's user facing, because no reasonable UX should foist the choice of chunker on a user who A) doesn't care, and B) can only possibly make wrong choices, per "using more than one value is a net loss for the system". If someone wants to use our libraries to build new tools with different values, that's fine. But our ecosystem should be an ecosystem: and part of things working well together involves picking concrete values for these things so user's don't have to.

For unixfs we will want to have extended attributes, we could store it as one.

Please no.

Xattrs are already one of the swampiest string-string bags in linux. Let's not add to them. Do we really seriously even want to consider fragmenting our already-fragmented-by-variable-chunking files by putting the chunking parameters in another header that makes even more hashes tend towards not converging?

Similarly, how reliable is Rabin across different implementations of the algorithm?

Extremely, and if it's not, it's a critical bug. A Rabin fingerprint is supposed to be not far off from the complexity of a CRC. There should be test vectors. There is no 'close'; there is correct and not correct. Non-identical behavior of a Rabin fingerprinting implementation is exactly as wrong as a non-identical behavior of a function that calls itself "sha1": there is no acceptable amount of mismatch.

Tl;dr: Logging a bunch of meta info does not give reproducibility/convergence when multiple uncoordinated users upload the same content. The "uncoordinated" part is important.

(My perspective on this is shaped a lot by working on Repeatr, which cares deeply about this kind of convergence, because we want to use the hashes of filesystems to check equality, and if that property isn't available without {unreasonably vasty amounts of additional configuration}, then we've got... well, we've got something that doesn't work.)

Has anyone tried spec'ing this out from the opposite direction -- what user stories do we have that really demands having parameterized chunkers?

There's a broad collection of use cases where files mutate and the changes need to be synced to a client who has the old version of the graph.

In these cases it's important for the party who is doing the mutation to know how the data was chunked so that when it chunks the new data it follows the same parameters. If it doesn't the new representation will have far more new blocks than it would have if it followed the chunking settings of the prior chunker.

I'm not aware of any other systems deployed in the wild which have significant usage and support parameterized chunking. Git doesn't. Venti didn't. I'm pretty sure from Backblaze, etc, blogs that they don't. Casync does

All of these tools are more specific to a single use cases than we are. They have algorithms for chunking specific to their use case and can easily assume that their other clients will follow the same logic.

Since we want to support use cases that would have conflicting requirements on the chunker we need a way to state which method was used at encode time in order to be more interoporable.

It's worth noting that these settings only tell another client how the data was initially encoded, it doesn't force them to use the same settings. Another client may not even support the algorithm used by the original encoder of the data and decide to use something else entirely.

Tl;dr: Logging a bunch of meta info does not give reproducibility/convergence when multiple uncoordinated users upload the same content.

Correct, it is mostly useful when one actor uploads content and another modifies it.

For unixfs we will want to have extended attributes, we could store it as one.

Please no.

I think we might be mixing up "extended attributes" with the need for some sort of "meta" property that we allow clients to populate with whatever arbitrary information they want. "extended attributes" implies that these attributes would be serialized into a filesystem that supported extended attributes and I don't believe that is the goal here.

Hey-- @warpfork mentioned this issue to me.

Heads up that:

In the interest of getting the spec ready for use this quarter I'm going to table this.

The spec will include a meta field we can use to store this information. Based on usage we'll standardize something along the lines of fmtstr in the future.

rvagg commented

closing for archival