holepunchto/hyperdrive

Proposal: Storage of sha for files in the tree

Closed this issue ยท 16 comments

User-Problem

Many files got/get shared between users as part of a set. So many users (particularly in the same organization) maintain copies of the same file. If those users create new dats for every file they replicate that file even though they wouldn't need to.

This is particularly important when using a central tool like datBase or hyperdrive that (also) acts as backup for the data. Those central storages could easily just download the same file once for each client.

Code-Problem

Hyperdrive doesn't store the sha information in the tree which makes it impossible to determine the content before entirely downloading a file.

Solution

  1. Add optionally (default=true) sha property to the file-tree item: https://github.com/mafintosh/hyperdrive/blob/2a585c9a7906bd6d432ab9b3f7fd959fb8c3417c/index.js#L585-L595
  2. Add an optional API that allows to lookup files by their sha-hash in a cache. opt.cache.get(hash, cb) (?)

Would you accept a PR that implements this?

This idea has come up before (dat-ecosystem-archive/datproject-discussions#77 (comment)) and i'd be in favor of it. There's some more thinking to go in to it though: which hash functions to support? We already use BLAKE2b elsewhere, but md5/sha1 are much more commonly used today for this sort of de-dupe (as opposed to cryptographic assurance). How do we make this feature "future proof", allowing evolution of which hashes are used? What is the specific additional overhead incurred to calculate this hash for large files? Is this even useful if we don't find the "inner" hashes of compressed files as well?

My current opinion is that multiple hash types should be allowed, but we default to just SHA-1 to start. By default clients would compute and add the hash, but not verify on download (rely on merkel tree for that).

I think the cache API/lookup should be done by an external library, at least to start.

has come up before

Thanks! ๐Ÿ˜

How do we make this feature "future proof"?

My thinking is to add {hash: "blah", hashFn: "sha-1"} to the tree.

What is the specific additional overhead incurred to calculate this hash for large files?

I can see several different overheads:

  1. CPU Overhead for calculating the hash
  2. CPU Overhead for transferring the chunks to the hashing function
  3. Memory overhead for storing the hash + function-name
  4. Transfer overhead for transmitting the hash

Since all of this is added through an option, it will stay possible to transfer data at an unobstructed speed.

the cache API/lookup should be done by an external library

Yes, but how to trigger and weave it in? My approach would be to add a cache object to opts:
https://github.com/mafintosh/hyperdrive/blob/2a585c9a7906bd6d432ab9b3f7fd959fb8c3417c/index.js#L25

where cache would have a cache.get(`${hashFn}/${hash}`, cb) method to make sure its partially compatible to leveldb.

This kind of metadata is also useful in applications for establishing happens-before precedence. If you can reference the hash of a file, then you know it was created prior to the hash's publication.

@mafintosh a few questions:

  1. As mentioned in the comments before: should it also store what hashing method was used?
  2. Is it okay to be optional? With a flag in the opts?
  3. What about the cache-lookup API?

@martinheidegger

  1. yes, or just support blake2b for now (your call).
  2. yes should be optional.
  3. cache-api is out of scope for hyperdrive but def something that could have value by itself.

@mafintosh Thanks for the reply. I guess its easy to get started working on it :)

What I think about when I think about future-proofing this: Can this be parallelized over GPUs when I REALLY need to sync.

I'm curious, too, if using factoradics as an index/hash is a way to optimize dividing up the indexing work. See: https://www.researchgate.net/publication/292950384_A_GPU-based_Branch-and-Bound_algorithm_using_Integer-Vector-Matrix_data_structure

Also, the Bela Ban and team of jgroups.org have done a lot of the practical ground-work around optimizing state-sync over tcp and udp. Yes, it's Java, but the hashing and syncing logic has been optimized for decades now and includes a decent amount of wisdom to consider. Not as new as a permutahedron, approach, tho. ;)

See: http://jgroups.org/demos.html

ReplicatedHashMap

I spent some brain-cycles to rethink this issue and following bugged me: Having the sha in the metadata information means that we need to download the whole metadata tree before we can access a dat's entire lookup. In order to allow random access, this needs to be in a trie structure and probably a separate hypercore. If it were stored together with the metadata it would just result in duplicate storage of the hashes (once in the metadata and once in the trie).

Thinking about this, I believe that the solution I proposed in this issue is not a good idea. Should I close this issue?

I thought metadata was already being fully synched up when you connect to a peer, even in sparse mode.

@RangerMauve There is a sparseMetadata option in hyperdrive that treats metadata as sparse, which is useful (and used) when you have a lot of files right now. Though I assume this will be even more prevalent when using hyperdb as backend for hyperdrive. Basically: There is no reason to replicate the metadata of a million files of a previous version, lets only replicate the metadata of the hundred thousand files required for the current file-tree. (and also the download of this metadata might take some time...)

Huh, TIL. ๐Ÿ˜›

I'm 100% into having another hypercore with SHA hashes. It'll make interop with stuff like git easier and maybe help with deduplication and merges / forks.

Reading through the code a bit, it seems like this could be done either in hyperdb or hyperdrive. But it kinda seems more reasonable to do it in hyperdrive (using hypertrie), as it even could be backwards compatible (if implemented in both branches.)

Thought I think having an index (hash-index as first variant) on a hyperdb would be an awesome general thing.

I'm curious to see if a permutahedron could be used for this.