ipfs/service-worker-gateway

Trailing / for UnixFS directory normalization is missing

achingbrain opened this issue ยท 11 comments

Trying to click a footnote takes you back to the Wikipedia homepage instead of the footer of the current page:

image

This is because the footnote href is a relative link up one directory:

image

All local links seem to have this problem:

image

@achingbrain do you have specific article to reproduce? Check if this was also broken when loaded via dweb.link gateway.

iirc this ancient wikipedia mirror has JS that fixes relative links post-load, so if that JS file failed to load via SW, links will be broken.

Sure, I posted a bunch in #61

E.g. click any linked text here: https://bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze.ipfs.sw.sgtpooki.com/wiki/Archaeology

iirc this ancient wikipedia mirror has JS that fixes relative links post-load

Ah, ok - there are some js errors in the console - a syntax error in the file, maybe that's it. Is there a later CID for the mirror?

Ok, I know what is the problem.
Yet another signal we should set up https://github.com/ipfs/gateway-conformance with proper coverage.

SW gateway does not normalize directory URLs. They must end with /. Gateway ensure that by redirecting UnixFS directory paths to URL that ends with trailing /.

Example:
https://bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze.ipfs.dweb.link/wiki/Archaeology returns HTTP 302 to https://bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze.ipfs.dweb.link/wiki/Archaeology/ and that makes relative links work correctly.

This is not specific to Wikipedia, this is a bug on all UnixFS directories (with index.html)

I suspect we need to fix this inside of verified-fetch because we need to know if the terminal element of content path is a directory. Support https:// URLs, and return HTTP 302 with fixed URL when Accept header in request includes text/html.

They must end with /

Why is this the case?

I think we should be able to serve a directory with or without a trailing slash. My thinking is that it's not a requirement for apache/nginx/etc so people will author their websites with either style, plus forcing redirects for (potentially) every directory request is expensive.

If this is necessary, a 301 would better so browsers don't re-request the invalid URL?

we need to know if the terminal element of content path is a directory

Unless I'm missing something we can detect this as we traverse the DAG - we need the terminal node to display the content, at which point we can interrogate the UnixFS metadata to find out if it's a directory or not, no need to guess from the path.

[..] that it's not a requirement for apache/nginx/etc so people will author their websites with either style

Correct, if one only cares about their single website, loaded from a single URL, this does not matter. If we want to build a reliable ecosystem where people can publish in permissionless fashion, and we have 3 addressing schemes, we need to normalize, otherwise we end up with bugs like the one described in this issue.

Trailing slash decides the anchor of relative links, and how things like service worker scoping works, has both practical and security ramifications.

This is also to ensure relative links work the same way on legacy path gateways: /ipfs/cid/ (where you technically can skip trailing /), and origin-isolated contexts: https://cid.ipfs.dweb.link (path is implicitly /) and ipfs://cid (where path is implicitly / and can't be skipped and turned to `` empty)

We have it as explicit part of gateway specs and conformance tests for this very reason:

Added reasoning why we do this in ipfs/specs#464.

we need the terminal node to display the content, at which point we can interrogate the UnixFS metadata to find out if it's a directory or not, no need to guess from the path.

Yes, in go-ipfs code we made the decision to redirect after we've read terminal element, one important thing is to preserve any query params:
https://github.com/ipfs/boxo/blob/89248079aaf6689e010d5ea1dd689ed717f0c8b6/gateway/handler_unixfs_dir.go#L52-L55 (ignore ?go-get, it is a legacy Kubo thing)

And yes, we want 301 as the final one, just easier to test with 302 so I always start with that :)

Quick update on the state of things

Took {redirect:'manual'} from ipfs/helia-verified-fetch#5 for a spin in #82 but it turned out to be not enough because verifid-fetch does not support http(s):// requests yet.

Service worker gateway needs to convert http(s):// URLs to ip[fn]s:// before passing to verified-fetch.

What happens

With ipfs/helia-verified-fetch#5, resolving http://cid.ipfs.localhost:8080/dir can

  1. be requested as verifiedFetch('ipfs://cid/dir', {redirect:'manual'})
  2. but that produces Response with status 301 and header Location: ipfs://cid/dir/ (which is not supported by browser).

Ideal state

I believe we should not spend time on temporary translation glue code in this project, instead spend time on implementing support for http(s):// URLs upstream, in verified-fetch.

What we want:

  1. Request URL as-is: verifiedFetch('http://cid.ipfs.localhost:8080/dir', {redirect:'manual'})
  2. produces Response with status 301 and header Location: http://cid.ipfs.localhost:8080/dir/ and that is all.

I think we want to do this right, in a way that is future-proof.
This means fixing this bug depends on ipfs/helia-verified-fetch#11 landing upstream first, and then re-enabling {redirect:'manual'}

https://github.com/ipfs-shipyard/helia-service-worker-gateway/blob/440bf9152d3410ef90e69154c63f32778acb7e31/src/lib/heliaFetch.ts#L162

If the user does verifiedFetch('ipfs://cid/dir', {redirect:'manual'}) would we want the redirect location to be the browser-incompatible ipfs://cid/dir/ or the slightly fictitious https://cid.ipfs.local/dir/?

At the moment ipfs://cid/dir will redirect to ipfs://cid/dir/ and https://cid.ipfs.foo/dir will redirect to https://cid.ipfs.foo/dir/

Tests to prevent regressions are at ipfs/helia-verified-fetch#15

this should be done, the verified-fetch version isn't updated in helia-service-gateway yet though?

Just to confirm, @achingbrain this is in https://www.npmjs.com/package/@helia/verified-fetch 1.1.0?

If so, we need to refactor this repo to pass original request URL (http(s)://)

ps. ah, yes, ipfs/helia-verified-fetch#15 is adding test to guard the behavior we need, thanks!

this will be fully resolved once #151 and ipfs/helia-verified-fetch#33 are merged and we update to the latest verified-fetch version