peterbourgon/diskv

CAS feature request/offer?

Closed this issue · 6 comments

I was looking for a CAS[1].

Seems like it could be a pretty small change to diskv. You would need a cryptographic hash function, like say skein or sha256. Then something like:

d := diskv.New(diskv.Options{
        BasePath:     "my-data-dir",
        Transform:    flatTransform,
        CryptoHash: sha256,
        CacheSizeMax: 1024 * 1024,
    })
(key, err) := d.put([]byte{'1', '2', '3'}))
...
(value,err) := d.get(key)

I use put/get because they imply (to me anyways) atomic operations, that read/write do not.

Opinions? Alternatives?

If I did something like the above and send a pull request, would you consider it?

[1] http://en.wikipedia.org/wiki/Content-addressable_storage

CAS was definitely a motivating example for diskv. At least in my estimation, it's already possible, albeit via an extra, explicit step:

val := []byte{'1', '2', '3'}
key := cryptoHash(val) // this one
err := d.Write(key, val)
val, err = d.Read(key)

Would folding this functionality directly into diskv enable a use-case for you?

More specifically, look here.

On 10/28/2013 08:38 AM, Peter Bourgon wrote:

More specifically, look here
https://github.com/peterbourgon/diskv/blob/master/examples/content-addressable-store/cas.go.

Looks good/close. For my uses I was hoping for:

  • put (value, and optionally key/checksum)
  • put would return an error if key/checksum didn't match checksum(value)
  • get would return an error if key didn't match checksum(value)

I can do that on my end, just thought it might be a nice addition to diskv.

diskv's surface area is, hopefully, as minimal as possible, by design. A use-case like CAS is so easily implemented by the client that I can't currently justify expanding the API to accommodate it. But, honest thanks for your thoughts and consideration! I hope you find diskv useful anyway :)

Thanks, I'll write a wrapper for the needed functionality. Have you by chance looked at the optimum directory size for best random I/O on a normal ext3 or ext4 filesystem? Basically the optimum value for transformBlockSize.

transformBlockSize controls the number of directories at each level, and also the depth of the directory tree. In the case of MD5, I chose 2, i.e. 32 bits, so that the full hash would only go 16 levels deep, and each level would have maximum (00..ff) = 256 directory entries. You could probably make it 4 (8 level tree, 65536 entries per directory) without stressing the filesystem, but then you start hitting the limits of commandline tools. For example, ls * would break, I think.

I don't think there's a "best" number for performance, but you probably want to avoid super-deep trees, i.e. a transformBlockSize to hash length ratio that's too low, and you probably want to avoid putting too many files in a single directory.