internetarchive/dweb-mirror

Q: effort towards making IA "distributed"?

Opened this issue · 15 comments

drok commented

Hello,

Background

I am an engineer and have some time and software/systems development skill to volunteer, as well as a small amount of reliable datacenter hosting (unused bandwidth and storage on VPS's I use for other projects).

I came upon this project on October 12, 2024, which will likely be known in the future as the day/week when the Internet Archive was knocked offline by DDOS.

I came here looking to continue research while the IA is down, as my current research heavily relies on IA. If there were a way to reach the documents from another source than directly IA's hard-drives, eg, IPFS, I could continue my work.

Question 1

Is there any effort ongoing, desirable or welcome towards building a "distributed" mirror of the Internet Archive?

I am not intimately familiar with this project, or with IPFS, but I have the skills and ability to work on integrating a peer-to-peer distributed filesystem.

I have noticed that a lot of effort has been made with internetarchive/dweb-transports@b93c251 and and related IPFS/WebTorrent initiatives, but a distributed Internet Archive mirror remains elusive to mere mortals. Thus, my question remains, is there still desire to achieve a distributed IA?

Question 2

I naively (curl -o- -L https://unpkg.com/@internetarchive/dweb-mirror/install.sh | bash) attempted to install the project on Ubuntu as per [INSTALLATION.md] and this failed. I took note that this project could use a user friendly installation package (a "one-click" way to access the data).

I could work on the accessibility of the project, too, if that were desirable/welcome. Is it?

The simplest (from the end user's perspective) installer I envision would be a web page that implements a WebWorker/ServiceWorker that implements this project and uses the browser's local storage as a cache.

The second simplest way would be the traditional app installer that most computer users are familiar with. It would not be limited to Pi users as it seems to be today, nor to computer savvy people. Anyone who has the minimal skill to install Chrome should be able to install this program.

As above, I noticed some effort has already been made, but it's not clear that there is still interest/support from IA for it.

Conclusion.

I want to help make the data within the Internet Archive more accessible, even the https://archive.org site is offline for any reason.

Thanks for your suggestions - I should make it clear that I haven't been actively supporting this project for a while, which means in particular I haven't been dealing with bitrot - the tendency of packages to stop working as things around them change !

Intent of dweb-mirror

The goal of dweb-mirror was to allow for independent boxes that could operate as local hosts that mirrored a selective part of the archive and that updated that offline collection whenever there was net access.

The main expected users of this were expected to be remote schools etc. It was intended to work stand along, or as part of systems like Internet In a Box or Rachel, which provided other offline host capabilities (mail servers etc).

In practice the resources on the archive were not what those communities needed - in the right language, current, matching school curricula etc, and the project was discontinued at the point we realised this. By that time, I had got most of the Archive uploaded files to work, but not the Wayback Machine, which would have had a different set of potential users. The architecture of the Wayback Machine is different enough that it required a lot more development to support.

Back to the installation ….

I was never able to get a single “install.sh” to work from scratch on small boxes like RPI & Arabian, these small boxes are different enough that they all needed some base work before getting to the install.sh. In most cases this included getting a copy of the OS onto the box.

The main market wasn’t fully working Ubuntu, Debian etc etc systems, so dweb-mirror was never turned into a package that could be installed with apt-get. Those small boxes (except the Intel Nuc) didn’t have apt-get on them.

It was also never intended to work as an app for single users (e.g. of a personal Mac, Windows or Linux box), again this wasn’t the market. I can’t immediately think of reasons that it wouldn’t work for that. Certainly my development was done on a Mac.

There are installation documentation files for step by step installation on different platforms, including for a developer on a Mac. AFAIK they should still work, though I haven’t tested them in a few years.

To get a version going on your Ubuntu system would be to follow through INSTALLATION.md , it’s probable close to the instructions for the Intel Nuc, feel free to integrated your learnings as steps for Ubuntu following the same format. Hopefully it will be simple enough to skip straight through to Step 4. 

Same goes for install.sh - if you can’t figure out why it isn’t working and fix it that would be great.

Distributed Archive thoughts

A lot was done in the project that preceded this to integrate IPFS, WebTorrent and several other technologies. Of these, only WebTorrent was close to stable, and still works well. 

IPFS, at the time, couldn’t handle any significant volume of files - its DHT was the limiting factor, in addition its really all-or-nothing, i.e. if you tried to fetch a file on IPFS there was no way to find out if it worked or not in order to fallback to HTTP. There was also enough change in APIs over time (aka BitRot), that code using it failed. I stopped using IPFS before dweb-mirror started, so I don’t know if either of those problems were fixed.

WebTorrent is generally stable, and has really good fallback characteristics i.e. you can try and fetch a file with WebTorrent, and if the WT fails, then it falls back to HTTP and seeds from that. I actually think that someone could create a really useful tool that creates WebTorrent files from a good, but at-risk, HTTP address.

I have no thoughts on doing this in an App, as I’ve neither developed for Web Workers or standalone Apps.

Support …


While I’m not actively supporting dweb-mirror, I’ll be happy to look at and integrate any Pull requests.

There may still be people using dweb-mirror via the Internet In a Box platform, as it is, or at least was, integrated in their standard release. It’s important not to break that, as that platform has a limited number of volunteer developers.


My own dev interests have diverged - I still work with resource constrained environments in LMICs (Low & Middle Income Countries) but my current focus is on affordable sensors for agriculture, etc. See Frugal-Iot

drok commented

Thank you for the insightful overview. I will take a closer look at this and the related dweb projects to see how I can help make the caching more distributed and readily installable. I have a feeling there are a lot of underutilized NAS's and desktop hard drives whose owners would be happy to contribute some storage and bandwidth if there was a packaged app available and easy to install and manage.

There are 13M Synology (~25% of market share in its market) NAS's out there. If some percentage would dedicate 1TB to act as near-line distributed IA cache, DDOS-induced outages might be avoidable. I'd like to make it trivially easy for such NAS owners to help the IA fight censorship, if they wanted to.

I will keep in mind the Internet In a Box project, thank you for pointing it out.

@drok ,

Hello,

You are asking a very good question!

I am particularly aware of the current issues sent to the website web back machine.

I am currently working on deploying wayback machine on the censorship resistant web hoster freenet. As I told by email, It will resist to DDOS with no need to maintain the infrastructure.

In order to fix these, I would like to bring the wayback database on a very serious censorship resistant web hoster known as freenet that divides the database between people using it instead of duplicating it like many censorship resistant hosts.

It will fix:

  • judiciary issues:
    • as no host is required, the host will not be responsible of potential copyright infringement
    • with the lack of hosting, wayback hoster should remain forever a non-profit organization and does not have to handle any money with no ambiguity. It is just code and nothing else.
  • technical issues:
    • from what I read in the freenet specs, the db of each site is divided for each user instead of duplicating as many p2p hosters do. Then there could not be any

I see there are some attempt to run ipfs. IPFS has a different will than freenet as it:

  • provide good performances

but does not:

  • protects against ddos attacks against a specific infrasas it does not protect nodes privacy (their IP is public) nor change them and then could be ddos
  • do not requires not hosting: the nodes always not to be available. unhosted websites are less prones to judiciary civil penalties.

while freenet:

  • requires not hosting and then is less prone to judiciary issues ("resists against censorship")
  • poor performances
  • always the same performances: no ddos possible

To conclude, I suggest to combine freenet and IPFS.

IPFS will alway run day and night while freenet will run only in case of ddos to maintain an existing page.

Nice to know your opinion.

I am also confused with the term "ai". wayback machine is an history of internet websites. not ai tool.

drok commented

I am also confused with the term "ai". wayback machine is an history of internet websites. not ai tool.

IA = Internet Archive

Also, the Internet Archive is far more than the "wayback machine" - it also archives content from libraries, and many publications that are not otherwise electronically available. Personally I use it to look up legislation from the 1800's - that stuff is not on any website, though it is available from some libraries. I love it that the IA makes it possible for the librarians to scan their collections and upload them to the Archive so the archive can then provide "Universal access to all knowledge", as is the motto of the Archive.

fine @drok !

Do you would like to make a common repo to host it ion freenet? :)

There should be a public poll on this, I would gladly purge a movie collection if it helped keep IA distributed.

The question about ethical issue is very pertinent.

At this moment I just hope freenet fits for this usecase because:

  • it has to handle some TB. How much exactly?
  • freenet works with ts/node/rust frontend only at tthis moment
  • it will require a lot of IA modifications
drok commented

@gogo2464 I am not at all familiar with freenet, other than a cursory review, but does "the freenet" not fit the role of a transport, to sit side by side with WebTorrent, HTTP, IPFS, filecoin and others in the https://github.com/internetarchive/dweb-transports project?

If the IA stores 200PB today, it might need to store 200EB in 20 years. Scalability would matter more than than absolute size by todays standards.

@drok from user programming point of view, you can compile your js/ts/rust and it will be compiled to wasm. freenet code is full wasm.

@gogo2464 I am not at all familiar with freenet, other than a cursory review, but does "the freenet" not fit the role of a transport, to sit side by side with WebTorrent, HTTP, IPFS, filecoin and others in the https://github.com/internetarchive/dweb-transports project?

If the IA stores 200PB today, it might need to store 200EB in 20 years. Scalability would matter more than than absolute size by todays standards.

It is the very big picture!

Sadly after the issue is freenet has to redistribute the backend in wasm. it requires a specific backend in rust from sites dev... the frontend could be in typescript. this is because the website could only be seen in wasm.

drok commented

@mitra42 would you mind addressing the API question: Is there an API that a distributed archive can rely on? Has the API development/maintenance also fallen by the wayside?

I've only become interested in the "distributed" topic due to the Oct 9 outage, which seems to be ongoing. I've been waiting patiently for the site to get back to normal before starting to work on this project, because I don't have working installation of dweb-mirror - I don't know how it was meant to work or look. If I should simply wait some more until the API documentation returns, and the dweb.archive.org returns, tell me that.

I am not able to bootstrap my docker install of this project, because the downloading process cannot access "https://dweb.me/info" - should that work, eventually, if I wait long enough?

I have the feeling that various pieces of the dweb have suffered bit-rot, but I don't have a good feeling of which are rotted vs which are just temporarily broken due to the DDOS.

I'm available to do work, and I'm just trying to take stock of what's already done, and where my time can be spent best.

Please illuminate the work area some more.

@drok - the short answer is that I don't know, I'm not at the Archive any more, and while I am in touch, I'm not going to pester people about this while many higher-priority services are still down.

The ping to Dweb.me/info shouldn't be needed, and I doubt that anything relies on it responding. It was used to tell which services were up or not.

AFAIK all the Archive APIs that the dweb code (including dweb-mirror) relies on are still there, but with the security tightening during the DDOS I don't know - some may be down temporarily, some may permanently not work, its probably best to wait till the Archive is fully back - both because we'll know then which (if any) APIs were disabled, and also there might be people available with energy to reenable them, which won't be the case at the moment.

Lots of dweb underlying technologies have bit-rotted, the IPFS API saw a lot of changes, and I stopped revising for them, but dweb-mirror doesn't use IPFS. I believe it uses webtorrent where available, but can't remember for sure.

By far the most stable dweb protocol for distributing stuff in my experience (and others may disagree) was webtorrent, mostly because it plays well with HTTP, meaning that if you try and load a torrent file and it can't find a distributed version, it will fetch the original on http and then start seeding it.

Maybe I should be clear about what dweb-mirror does. ....

It is good, and efficient, at crawling parts of the Archive to make offline copies in a file system, and at caching anything the UI accesses.
It does this (mostly) through publically available APIs, though it gets these through some special (very simple) services at the Archive that, for example, fix up some issues with the metadata and torrent files.
It also has a copy of an older version of the UI (dweb-archive) that can run offline from the local copies of the file.

It doesn't do anything else - like peer to peer between boxes - because it was designed to work in intermittently connected environments like remote schools. It never got traction, because those schools mostly needed current, curricula-aligned, materials in their own language, and we didn't have those materials on the Archive.

@gogo2464 I am not at all familiar with freenet, other than a cursory review, but does "the freenet" not fit the role of a transport, to sit side by side with WebTorrent, HTTP, IPFS, filecoin and others in the https://github.com/internetarchive/dweb-transports project?

If the IA stores 200PB today, it might need to store 200EB in 20 years. Scalability would matter more than than absolute size by todays standards.

How much size is IA at this moment please?

drok commented

How much size is IA at this moment please?

Based on my own research, it stores around 100M items (documents, films, audio recordings, books), while the Wayback Machine stores ~900M distinct "web pages", each with its own history. The "Archive" and "Wayback" parts of the IA are different applications, based on different archival logic. Together, the amount of disk space used for storage is around 200PB (in 2021).

Sources:

Indexes

I forget where I saw it, but at some point in time the indexes were in the order of single-digit terabytes (6TB is the figure stuck in my memory), and the index search workload was sharded in some fashion to around ~60-ish machines.

My effort

BTW, having looked at what was implemented in the dweb projects, I think the lack of distributed indexing is where effort should be spent next. I've started drafting some ideas for an implementation of distributed index based on existing DNSSEC technology, which would allow the IA organization to maintain control over the index, but also to delegate it to a large number of distributed shards. I think it's critical that responsibility for the content of the index rest with one organization for liability reasons (however, in my proposal, "forking" the Internet Archive would be easy to do, allowing other libraries to become distinct heads-of-liability, and allowing them to index the same, slightly, or vastly different content). Forking is essential, because whoever wants to destroy the IA organization, might eventually succeed. The knowledge should be preserved, even in that unfortunate event.

In my DNSSEC idea, participant volunteers who want to donate storage/compute/bandwidth, would install and run a "dweb+indexing" application which would host a shard of the index, as well as some of the cached data, and their instance of the app would answer index and/or data queries from their own shard, or by forwarding the query to another indexer in the namespace. The searching part would be based on DNS, while the data distribution would be mainly based on torrent tech, as envisioned in the dweb projects - I find torrent tech has a lot of merit.

I will add a pointer here when my proposal is ready to be shared...