bio-guoda/preston

build bridge to IPFS-land

Closed this issue · 12 comments

IPFS aims to store content without relying on some centralized service like DNS .

Preston keeps track of (biodiversity) content.

Idea - make preston and IPFS interoperable.

Step 1. Small example

using https://docs.ipfs.tech/install/command-line/#install-official-binary-distributions, I was able to add file to some local ipfs store.

echo "bleep" > file.txt
./ipfs add file.txt 

produced

added QmYou3ngXxSek7rfbATTWw8gduBqKXwMebb6ber5J2SwMh file.txt
 6 B / 6 B [===========================================================] 100.00%

and, vanilla hash -

cat file.txt | sha256sum
9f7807097477f4f480130cefd2521e033534ac967ec36119e18392bce24d81d3

aka

hash://sha256/9f7807097477f4f480130cefd2521e033534ac967ec36119e18392bce24d81d3

The question is: how to calculate QmYou3ngXxSek7rfbATTWw8gduBqKXwMebb6ber5J2SwMh independently?

from https://docs.google.com/document/d/1_vL-hxsHGcy85g7EIUdLesztXFofQ9QW4VdZG3K5J8g/edit

#5 – 13 June, 16:00 CEST
Present: Philipp von Essen, Tobias Kuhn, Erik van Winkle, Axana Scherbeijn, Jorrit Poelen, Lauren Grieco, Louis ter Meer, Lyubomir Penev, Ronen Tamari, Steffen De Jong

Agenda:
Erik van Winkle on "Atomic Units of Knowledge in Research Objects: How FDOs Targeting Different Use Cases Can Collaborate"
Presentation
Nodes-Demo-Document

with demo document attached and retrieved from https://nodes.desci.com/PZIlDkMRS_iM3HF3rAPZe1E8UsCDa-ncbM4dnsgfxA4 .

./ipfs add Demo_Research_Report.pdf 

produced:

added QmUMs6LdCNXugG489dt64Tjd4oT1BJuE4DqQtF8JSF12Y7 Demo_Research_Report.pdf
 891.43 KiB / 891.43 KiB [=============================================] 100.00%

with

cat Demo_Research_Report.pdf | sha256sum

producing:

003e19ab870d338fbd3983c17904cfbdaa4dca3ca89493756519afb15a39ad0c  -

so, again, the challenge is to generate QmUMs6LdCNXugG489dt64Tjd4oT1BJuE4DqQtF8JSF12Y7 independently given the file, so we can say something like:

003e19ab870d338fbd3983c17904cfbdaa4dca3ca89493756519afb15a39ad0c isHashOf X
QmUMs6LdCNXugG489dt64Tjd4oT1BJuE4DqQtF8JSF12Y7 isHashOf X

so we can say that hash://sha256/003e19ab870d338fbd3983c17904cfbdaa4dca3ca89493756519afb15a39ad0c and QmUMs6LdCNXugG489dt64Tjd4oT1BJuE4DqQtF8JSF12Y7 are hashes derived from the same content.

Demo_Research_Report.pdf

It appears that content in IPFS cannot be easily bridged to something outside of IPFS land.

If there's examples out there that do build this bridge, I'd be happy to reconsider construction of a IPFS integration.

Note that the difficulties of bridge to/from IPFS-land is not a new topic, see e.g., ipfs/kubo#1953 .

This is why I was a bit surprised to read DeSci contributors say in Hill et al. 2024 "Guest Post — Navigating the Drift: Persistence Challenges in the Digital Scientific Record and the Promise of dPIDs" in the Scholarly Kitchen accessed via https://scholarlykitchen.sspnet.org/2024/03/14/guest-post-navigating-the-drift-persistence-challenges-in-the-digital-scientific-record-and-the-promise-of-dpids/

[...] Furthermore, it’s easy to check if the content you received from the IPFS network matches its hash, eliminating the risk of downloading content from unknown network peers. [...]

Yes, it is easy to calculate a sha256 hash, but . . . not so easy to calculate their associated address in IPFS land (the CID).

Note that the authors did disclose their business associations with DeSci via

[...] Full disclosure: The authors are affiliated with DeSci Labs AG and the DeSci Foundation. dPID is an open-source software solution distributed under an MIT license, developed by DeSci Labs AG (https://github.com/desci-labs/nodes/). [...]

and that their product is built on IPFS.

fyi @mielliott @cboettig @mbjones - a continuation of IPFS and sha256 discussion.

m0ar commented

Hello @jhpoelen @mielliott @cboetting @mbjones 👋 I work with engineering at DeSci Labs. For the sake of knowledge sharing, I'll take the liberty of posting here after I was forwarded your email touching on the scholarly kitchen article.

CID's are capable of expressing rich information about data encoding, while allowing content validation of partial transfers, while being agnostic to the hashing scheme. This is a superpower, but unfortunately it means that it's not as simple as running a hash function over the entire set of data.

However, under certain constraints, there can be a 1-1 relationship between sha256sum and the IPFS CID. Under a limit called the chunk size, IPFS will not break the file into smaller pieces and build a DAG out of it.

/tmp
❯ head -c 256KB /dev/urandom > 256kb.txt      

/tmp
❯ ipfs add --only-hash --cid-version 1 --raw-leaves 256kb.txt                
added bafkreid2flnnm6jvoxrowuuzpsvk4ppjtwl3p6mr2fsqg4l7mag64qllya 256kb.txt

/tmp
❯ ipfs cid format bafkreid2flnnm6jvoxrowuuzpsvk4ppjtwl3p6mr2fsqg4l7mag64qllya -b base16
f015512207a2adad6793575e2eb52997caaae3de99d97b7f991d16503717f600dee416bc0
#     sha|--------------------------------------------------------------|

/tmp
❯ sha256sum 256kb.txt
7a2adad6793575e2eb52997caaae3de99d97b7f991d16503717f600dee416bc0  256kb.txt

Those extra bytes as the front of the CID (f01551220 in hex) are describing the context of the content hash. In this case, it signals that the hash encodes the raw bytes of the file. This neat litte webapp breaks the format down for you: https://cid.ipfs.tech/#bafkreid2flnnm6jvoxrowuuzpsvk4ppjtwl3p6mr2fsqg4l7mag64qllya

If we add a file larger than the default chunk size (256kb), the resulting CID says that it's now encoding something different, namely a dag-pb structure. See breakdown of CID at https://cid.ipfs.tech/#bafybeibs3xpg4jkoytj4vdqkiymipixfikfyfo45ywpnfdowj2meb64kxy. IPFS has split the file into a tree of chunks, each with its own CID.

/tmp
❯ head -c 1MB /dev/urandom > 1mb.txt  

/tmp
❯ ipfs add --only-hash --cid-version 1 --raw-leaves 1mb.txt
added bafybeibs3xpg4jkoytj4vdqkiymipixfikfyfo45ywpnfdowj2meb64kxy 1mb.txt

/tmp
❯ ipfs cid format bafybeibs3xpg4jkoytj4vdqkiymipixfikfyfo45ywpnfdowj2meb64kxy -b base16
f0170122032ddde6e254ec4d3ca8e0a461887a2e5428b82bb9dc59ed28dd64e9840fb8abe
# NOT the sha below, because file split into chunk tree

/tmp
❯ sha256sum 1mb.txt  
143252d3384e455625e3ab709a50ab08f9f7472e2a610384157aef610d631ba5  1mb.txt

We can explicitly chunk the file in 1 MB segments, under which the correlation with sha256 still holds. 1 MB is however the ceiling, as this is the size where the practicalities of the transport layer bitswap comes in. IPFS clients simply communicate in content-addressed 1 MB chunks, so after that point things will always be DAG's of pieces rather than uniform files. This is key; IPFS CID's aren't built just for file integrity checking, they are built to express file structures that allow piecewise download from multiple sources, while allowing continuous integrity checking of the individual pieces.

/tmp
❯ ipfs --only-hash --cid-version 1 --raw-leaves --chunker size-1048576 1mb.txt
added bafkreiaugjjngocoivlcly5locnfbkyi7h3uolrkmebyifl255qq2yy3uu 1mb.txt

/tmp
❯ ipfs cid format bafkreiaugjjngocoivlcly5locnfbkyi7h3uolrkmebyifl255qq2yy3uu -b base16
f01551220143252d3384e455625e3ab709a50ab08f9f7472e2a610384157aef610d631ba5
#     sha|--------------------------------------------------------------|

The intuition here is that there is an n-1 relation between CID's and data, as you can describe the same data in different encodings and under different hashing schemes. A CID is expressing richer information than a hash, out of which you are looking for a smaller subset.

Putting these pieces together, by taking the sha256 hash of data under 1 MB, one can create the (hex) CID through this function:

❯ cat <(echo -n "f01551220") <(sha256sum 256kb.txt)
f015512207a2adad6793575e2eb52997caaae3de99d97b7f991d16503717f600dee416bc0  256kb.txt

This is a CID identifying a single-chunk, binary encoded file with the sha256 hash 7a2adad6793575e2eb52997caaae3de99d97b7f991d16503717f600dee416bc0. Here is the breakdown: https://cid.ipfs.tech/#f015512207a2adad6793575e2eb52997caaae3de99d97b7f991d16503717f600dee416bc0

Note that since this CID contains information about the format of the CID itself (i.e., is self-describing), so one can't just run base32 on the CID to convert it. This is the transformation that ipfs cid format performs, it's not magic, just changing the non-hash digest parts of the CID to describe the encoding you should use to get the content hash:

/tmp
❯ ipfs cid format f015512207a2adad6793575e2eb52997caaae3de99d97b7f991d16503717f600dee416bc0 -b base32
bafkreid2flnnm6jvoxrowuuzpsvk4ppjtwl3p6mr2fsqg4l7mag64qllya

I hope this information was useful in building a better understanding of how these mysterious CID's work :)

@m0ar Thanks for providing the example.

I agree that calculating CIDs is not magic, and your example confirms my understanding of the complexities related to calculating CIDs.

I'd urge to update the blog post to reflect these complexities in https://scholarlykitchen.sspnet.org/2024/03/14/guest-post-navigating-the-drift-persistence-challenges-in-the-digital-scientific-record-and-the-promise-of-dpids/ .

Right now, the text makes it seem like the calculation of a CID is as easy as calculating a sha256 hash, and I think this is misleading especially given your example above.

m0ar commented

@jhpoelen In my opinion, this is splitting hairs. The CID is indeed a digital fingerprint, and it is created using sha256. In the context of IPFS, it's as easy as ipfs add file.txt. In the consumer end, you resolve a PID to some CID, and when you fetch that file you know that you are getting the correct content as IPFS clients verify it for you.

The complexity you mention only appears when popping the hood and trying to translate between different ways of hashing things, which is a discussion both interesting and worth having. This discussion, or technical documentation, is probably a better forum though.

@m0ar thanks for taking the time to reply.

In the context of IPFS, it's as easy as ipfs add file.txt

Yes, if the ipfs client was universally accessible (which it is not) as sha256/md5/sha1 algorithms, this may be the case. However, as far as I understand, there's many implicit variables required to verify a CID (e.g., blocksize).

[...] you resolve a PID to some CID, and when you fetch that file you know that you are getting the correct content as IPFS clients verify it for you [...]

I'd like to be able to independently verify the retrieved CID content, and the design of IPFS to combine content addressing, content graph, content blocking along with a content exchange protocol, makes it difficult for me to implement IPFS support.

So, equating IPFS CID computation with sha256 content hashing is far from spitting hairs in my mind: as you demonstrated, the CID computations involves many (implicitly configured) processing steps (e.g., content blocking, content hashing, putting blocks in a content graph). In contrast, hashing digital content using a sha256 algorithm is an parameter-free operation supported by a diverse collection of software libraries across many platforms.

But hey, I can be convinced otherwise if you are willing to add some IPFS bridge for Preston. This way, you can demonstrate how to retrieve independently verifiable content from the IPFS universe. I got stuck on the complexities
and configurability of the IPFS CID calculations along with the lack of software libraries that help add this functionality to Preston.

Thanks again for engaging in discussion - I feel that we have a lot in common in realizing that content addressing is a useful way to point to digital content regardless of where/how they may be stored in the future.

@m0ar thanks for sharing, and thanks to @jhpoelen and all for an engaging discussion.

I think @jhpoelen 's suggested test that these systems based on SHA hashes should be interoperable is a good one. As another case in point, you are probably all familiar with the Open Container Registry spec which also tracks objects based on the SHA hashes. I think we can all agree the spec has proven it's ability to scale and be replicated independently -- for instance, GitHub notes that Homebrew project alone uses OCI-compatible GHCR to

distribute over a half a petabyte of binary packages to their users every month (1)

and we have seen many independent software implementations of the spec by major players. Because OCI is transparently SHA-256-based, it is also easy for third-party tools to retrieve content from these registries by sha-256 checksums (e.g. I believe @jhpoelen has done this already in preston).

The OCI support that @cboettig mentioned can be found at #255 and describes some events that led to the OCI support first being introduced in Preston v0.7.2 in July 2023.

This enabled content retrieval using:

preston cat\
 --remote https://ghcr.io/cboettig/content-store\
 hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37

with

first 24 lines being:

*******************************************************************************
*** Historical CO2 Record from the Vostok Ice Core                          ***
***                                                                         ***
*** Source: J.M. Barnola                                                    ***
***         D. Raynaud                                                      ***
***         C. Lorius                                                       ***
***         Laboratoire de Glaciologie et de Geophysique de l'Environnement ***
***         38402 Saint Martin d'Heres Cedex, France                        ***
***                                                                         ***
***         N. I. Barkov                                                    ***
***         Arctic and Antarctic Research Institute                         ***
***         Beringa Street 38                                               ***
***         St. Petersburg 199226, Russia                                   ***
***                                                                         ***
*** January 2003                                                             ***
*******************************************************************************
                Mean
       Age of   age of    CO2
Depth  the ice  the air concentration
 (m)   (yr BP)  (yr BP)  (ppmv)

149.1	5679	2342	284.7
173.1	6828	3634	272.8
177.4	7043	3833	268.1

@m0ar What would it take to implement a similar bridge to IPFS with the same (exact) dataset to be retrieved from DeSci's IPFS universe allowing:

preston cat\
 --remote https://nodes.desci.com/\
 hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37\
 | sha256sum

to produce the independently verified (via sha256sum) fingerprint of the retrieved content:

9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37

?