multiformats/go-multihash

Failure possible in case of large data's multihash

PayasR opened this issue ยท 7 comments

Currently go-multihash is using the in-memory strings as input and generating hashes of those. This can cause an issue in case of a large file given as input (it can't be read to memory). Need to handle hash generation of large data and input file(names?) too.

@PayasR can you link to the parts of the code using strings ? i see: https://github.com/jbenet/go-multihash/blob/master/sum.go#L15 which is using []byte. though yes, is a problem. we should be using an io.Reader there.

shoce commented

I met this problem while trying to calc cid's for 10+ gigabytes files and created this issue ipfs/go-cid#120 but i guess should continue here

@shoce The core package within this repo makes use of the standard golang hash.Hash interface, which allows streaming data to be hashed through by using it as an io.Writer.

The other thing to watch out for is that for large files, the CID for the file when using with IPFS may not simply be the direct hash of it's bytes, as there is additional metadata included when importing into IPFS so that large amounts of data can be transferred in multiple smaller chunks. Do you need the CID for the direct large file, or for IPFS representation of the data?

shoce commented

@willscott Thanks! For now i ended by with this piece of code:

	f, err := os.Open(path)
	if err != nil {
		return err
	}
	defer f.Close()
	fhash := sha256.New()
	if _, err := io.Copy(fhash, f); err != nil {
		return err
	}
	fmhbb, _ := mh.Encode(fhash.Sum(nil), mh.SHA2_256)
	fmh, err := mh.Cast(fmhbb)
	if err != nil {
		return err
	}
	c := cid.NewCidV1(cid.Raw, fmh)
	fmt.Printf("cid:%s", c)

Please let me know if u see any issues or possible improvements with it.

shoce commented

@willscott Thanks for explaining about CID and file metadata in IPFS, i could miss it. But for my purpose i do not care about IPFS interoperability or compatibility. Actually i just need a hash of file and i could just use sha256, but knowing about multihash, i thought it would be nice if i can change hash function easily later. May be i could just use sha256+multihash+multibase, but i though that CID is like already all this together with the ability to change hash so exactly what i need. But should i consider using just multibase+multihash?

We might recommend using https://pkg.go.dev/github.com/multiformats/go-multihash@v0.0.15/core#GetHasher as well, rather than constructing your own sha256 directly, because this would give you the easy ability to switch hashes. And as you might have seen (or not; github might not push notifications for mentions), #138 is another PR which will add further helper function to do it all in one step.

SumStream is now also available, as #138 is merged. I think that means we can (finally) close this one :)