nexus-stc/hyperboria

Putting SciMag collection onto IPFS

Opened this issue · 1 comments

Discussed some points with @eleitl.

Inputs

At the moment data looks following: 70 TB of data and around 90M of files. Every pinned 10TB requires around 20GB of metadata stored on fast storages. So we need at least 7-10 seeders who can pin each range of data but more seeders are welcomed as it increases coverage.

Preliminary Plan

  1. Provide here IPFS configuration and two scripts, the first is for downloading torrents and the second is for putting downloaded data to IPFS and producing text file with pairs (canonicaized DOI; ipfs-hash)
  2. Making out the way to manage who would be responsible for what range of data. It will be hard to do manually so I may create semi-public endpoint that would give a URL to torrent file that seeder need to download and other endpoint for accepting files with pairs
  3. Posting wide call for seeders in communities, at least in r/datahoarders, Telegram channel and privately to those who could be interested
  4. Accept all hashes until the end. Pair files should be immediately published in IPFS (by seeders themselves perhaps). Also, I embed them into my search database and after finishing (or during the process if it takes a long time) I will publish it packed with Python scripts to traverse this database.
    5*. Adding pinning feature to Nexus app to make possible for users to pin arbitrary sets of files

Seeders Requirements

  1. Any storage for bare data with capacity from 0TB up to 70TB
  2. SSD storage for metadata (20 GB for every 10TB of bare data).
  3. IPFS daemon configured with filestore featrue
  4. Mounting bare storage (i.e. /srv) via -bind to ~/.ipfs partition. Something like mount -o bind /srv /home/ipfs/src. Required due to the impossibility of go-ipfs daemon to work on different partitions.
  5. Possibility to seed at least for the following 12 months. During this period I hope we will manage to increase coverage of seeders up to reliable levels.

I have weaken minimal storage requirements after talk with shrine
The API endpoint will return torrent files one by one by request so we could reach even baby seeders with this approach. And it is better in terms of reliability.