ipfs/service-worker-gateway

Support direct HTTP retrieval from /https providers

lidel opened this issue ยท 11 comments

Example:

They have /https providers such as /dns4/dag.w3s.link/tcp/443/https.

TODO

  • confirm with dag house dag.w3s.link supports [Block Responses (application/vnd.ipld.raw)] https://specs.ipfs.tech/http-gateways/trustless-gateway/#block-responses-application-vnd-ipld-raw
  • mvp: when trustless gateway is slow to respond or errors, leverage delegated /routing/v1 endpoint to learn about additional HTTP providers
  • leverage DAG affinity information, and use known direct providers that had part of DAG before
    • ipfs/specs#462
    • when fetching some DAG's root for the first time, sw gateway should async ask /routing/v1 for /https providers, and store the affinity information for subsequent requests to that DAG. On affinity's presence, attempt retrieval from direct providers first, and use default/generic gateway only as a fallback.
  • BLOCKER for deployment to inbrowser.dev: Find a way Incorporate denylist from https://badbits.dwebops.pub/ to avoid facilitating hosting of bad bits under domain name owned by Shipyard.

Based on ipfs/helia#439 (comment), dag.w3s.link now supports raw block responses.

once ipfs/helia#483 is released, I believe we will get this for free by updating once we update @helia/verified-fetch

I would be afraid to ship this under shipyard-owned domain like inbrowser.dev without having some story for applying https://badbits.dwebops.pub.

I've added task "BLOCKER for deployment to inbrowser.dev" to the list above.

@SgtPooki @hacdias @aschmahmann we need to decide what to do here, before we enable direct downloads.

Options I see:

  • (A) apply filtering in JS and enable badbits subscription in the helia
  • (B) apply filtering in Someguy at https://delegated-ipfs.dev

In both cases, we could enable filtering by default only when Service Worker is deployed from inbrowser.tld, just like we enable it in rainbow deployed at ipfs.io, but people can run own without it.

For someguy, we could have query parameter, HTTP header, or dedicated domain.

Prior art: DNS services use dedicated domain/ip: https://quad9.net/service/service-addresses-and-features
If we follow similar pattern we could have nobadbits.delegated-ipfs.dev with filtering.

My vote would be to wire up denylist support into someguy so it applies it only on specific domains (based on Host header from request, no filtering by default), and set up nobadbits.delegated-ipfs.dev with it, and enable it when on inbrowser.dev|link, but maybe there is faster/easier way?

Thoughts?

I feel like both A and B need to be done eventually. With that said, we discussed why we shouldn't enable badbits by default:

We don't know the contents, and some countries may not consider certain bits bad. i.e. there's no way for us to tell what is truly, unobjectively, bad content.

Still, since we're developing in open source, the easiest, safest thing to do is to implement filtering by default and allow folks to fork and un-filter where they need to.

I think a simple blockstore wrapper that double-hashes and checks for badbits would be fairly easy to do in JS. It shouldn't take long, but I think configuring the update mechanisms for ensuring the JS badbits lib is always up to date without significant slowdown on clients is the challenging part.

  • If we download from the badbits list for build-time, we need CI to ensure it's always polling & publishing the latest.
  • If we download during runtime, that can significantly hinder startup time and may not always be available for download

These two things seem like something Kubo/boxo may already have resolved, and it may be faster or easier to do similar in someguy

Denylist from https://badbits.dwebops.pub/badbits.deny is ~15 MiB (gzipped).

If we go the JS blockstore route, no matter how it is fetched (remotely, or embedded in same DAG as SW), ~16 MiB penalty for initial page load is tough.

Perhaps instead of moderating routing responses, we could have a dedicated delegated denylist endpoint that could be queried?
Something like https://delegated-ipfs.dev/badbits/v1/{double-hashed} would allow delegated check and responses would be nicely cacheable with stale-while-revalidate mode.

I would be afraid to ship this under shipyard-owned domain like inbrowser.dev without having some story for applying https://badbits.dwebops.pub/.

I wish we could get out of the business of mandating blocklists here, rather than letting them be user controlled (whether opt-in/out). Maybe we're stuck with this reality for now, but it'd be useful to explore with some of the legal + open internet folks (IPFS Foundation should have some contacts) how much this is needed. Ideally it'd be possible for someone to deploy a public resource without having a specific denylist hardcoded given that different legal jurisdictions and individuals feel different things should be blocked.

Perhaps there's an expendable domain name we can use here if we're concerned about say being legally all clear, but running into issues with the technical middlemen of the web (i.e. curators of other resource blocking lists that don't understand or would disagree with our position).

Perhaps instead of moderating routing responses, we could have a dedicated delegated denylist endpoint that could be queried?

If going this route, we could limit the blocklist to be closer to the HTTP/request layer (i.e. not caring about blocked blocks/subdags) and consider something like https://security.googleblog.com/2022/08/how-hash-based-safe-browsing-works-in.html.

A few notes:

  • Google has an easier time given they can block on domains + pathing, and hashes don't really enable that, which might encourage us to be laxer
  • If doing this I'd suspect we'd want to deduplicate work/storage across subdomains
  • Alternatively, given that the current deployment requires going to bafyfoo.ipfs.<sw-gateway-root-domain> would it be plausible to not block subresources but just block at the subdomain request layer? The subdomain request is the only request to the infra provider where they'd even find out about what was being requested, so as long as that exists (i.e. no dynamically created subdomains via service workers) this seems plausible.

Alternatively, given that the current deployment requires going to bafyfoo.ipfs. would it be plausible to not block subresources but just block at the subdomain request layer? The subdomain request is the only request to the infra provider where they'd even find out about what was being requested, so as long as that exists (i.e. no dynamically created subdomains via service workers) this seems plausible.

I think this approach sounds good. One thing we may run into is someone creating a static site of all the badbits and then our <sw-gateway-root-domain> gets blocked by browsers because someone got a popular (non-badbit) bafyfoo subdomain with badbit subresources to get loaded by users.

Would that cause us to get blocked? if not, let's do it.

I believe that direct HTTP retrieval is now supported in Helia and @helia/verified-fetch with the sessions work.

The main thing left here is to add badbits blocking support.

Based on the comments above, it seems that the blocking could be implemented in (either or both):

  • The SW Gateway deployment, i.e. bafyfoo.ipfs.<sw-gateway-root-domain>: this seems relatively straight forward to support since we already do a similar thing for the gateway..
  • The delegated routing endpoint, either in someguy or on the reverse proxy layer.

@lidel @aschmahmann Any thoughts on this?

@2color I agree, been marinating on this for a while and I think we can enable it on inbowser.tld without waiting for badbits for JS.

I think we could have 3 stages, where (1) can be done TODAY, and is not controversial.

(2) and (3) could be discussed / tackled later (TBD order/priorities)

Step 1: badbits on inbrowser.tld

For inbrowser.tld we may be actually good enough with subdomains forcing load of SW installer for every root CID, and that being blocked by badbits the same way dweb.link subdomain gw is.

The blocking on Nginx for dweb.link is based on nginx calling badbits-auth-request microservice via auth_request in badbits.conf badbits_auth.conf included by dweb_link.conf.

Example (SFW, a picture of a boat, probably copyright thing?):

If we have this, we can enable direct retrieval on inbrowser.tlds safely without having badbits implemented in JS.

I've opened https://github.com/ipshipyard/waterworks-infra/pull/160 to do just that. Once merged, the second link should also return 410.

Step 2: delegated badbits in verified-fetch, someguy and delegated-ipfs.dev/denylist/v1

My working assumption is that delegated badbits are higher priority than native support in JS, because this benefits both JS and Nginx use, and provides a way of minimizing perf. cost on the end client.

A broad strokes plan:

  1. look at API from badbits-auth-request microservice, and clean it up
  2. create a spec for "delegated denylists" under /denylist/v1 (IPIP for https://specs.ipfs.tech)
  3. implement /denylist/v1 in Someguy โ€“ this allows us to replace NodeJS microservice with Someguy instance, reducing number of things we maintain and run, and allow people who run Someguy to get delegated denylists support for free
  4. deploy it at delegated-ipfs.dev/denylist/v1 and make service-worker-gateway/verified-fetch use it as a client

With (4) we would not only get denylist on initial load, but also on any subsequent block request made by SW itself.

Nice thing about delegated denylist, is that the page load is not impacted.

Step 3: Native badbits support in JS

No magic here, create performant, native implementation of https://docs.ipfs.tech/concepts/ipfs-gateway in JS/WASM.

The downside of doing this, is that Dapp load requires fetching additional 30MiB (and growing) denylist from https://badbits.dwebops.pub/badbits.deny

This could act as fallback when delegated denylist endpoint is not available.

That sounds like a good plan @lidel.

Regarding Step 2, I suppose if we add the denylist functionality in Someguy, we could automatically apply it for content routing requests to avoid an additional roundtrip to check the denylist. Any thoughts on that?

@2color we could do it, but not sure if we want to do that by default. We had false-positives in the past, there should be a way of disabling denylist. In #72 (comment) I had idea of having denylist applied only on requests to specific domain, allowing users to switch between delegated routing with/without denylist applied, if it ever causes trouble.

ps. "Step 1" is done: https://github.com/ipshipyard/waterworks-infra/pull/160 landed and https://bafybeib536upvgn7bflest7hqjvz4247ltqxn4hvjnjgatenmmufdak6aa.ipfs.inbrowser.dev returns 410