mholt/caddy-l4

Incorrect Handling of SNI: No Certificate Available (Regression)

elcore opened this issue · 17 comments

After updating the Caddy Layer4 module to e491c44, Caddy erroneously attempts to handle certificates for domains when it should instead proxy and forward connections based on SNI.

Logs:

{"level":"error","ts":1723754927.6938345,"logger":"layer4","msg":"handling connection","remote":"127.0.0.1:35770","error":"no certificate available for 'ud-chat.signal.org'"}
{"level":"error","ts":1723754927.7430856,"logger":"layer4","msg":"handling connection","remote":"127.0.0.1:35772","error":"no certificate available for 'chat.signal.org'"}
{"level":"error","ts":1723754927.9744139,"logger":"layer4","msg":"handling connection","remote":"127.0.0.1:35788","error":"no certificate available for 'ud-chat.signal.org'"}
{"level":"error","ts":1723754928.1224709,"logger":"layer4","msg":"handling connection","remote":"127.0.0.1:35802","error":"no certificate available for 'chat.signal.org'"}
{"level":"error","ts":1723754929.6092706,"logger":"layer4","msg":"handling connection","remote":"127.0.0.1:60324","error":"no certificate available for 'ud-chat.signal.org'"}
...

Configuration:

{
  "apps": {
    "tls": {
      "certificates": {
        "automate": [
          "sub.example.com"
        ]
      }
    },
    "layer4": {
      "servers": {
        "signal_proxy": {
          "listen": [
            ":443"
          ],
          "routes": [
            {
              "handle": [
                {
                  "handler": "tls"
                }
              ]
            },
            {
              "match": [
                {
                  "tls": {
                    "sni": [
                      "chat.signal.org",
                      "ud-chat.signal.org",
                      "storage.signal.org",
                      "cdn.signal.org",
                      "cdn2.signal.org",
                      "cdn3.signal.org",
                      "api.directory.signal.org",
                      "cdsi.signal.org",
                      "sfu.voip.signal.org",
                      "svr2.signal.org",
                      "updates.signal.org",
                      "updates2.signal.org",
                      "backend1.svr3.signal.org",
                      "backend2.svr3.signal.org",
                      "backend3.svr3.signal.org"
                    ]
                  }
                }
              ],
              "handle": [
                {
                  "handler": "proxy",
                  "upstreams": [
                    {
                      "dial": [
                        "{l4.tls.server_name}:443"
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    }
  }
}
  1. Reverting to 94cd399 fixes the issue

What happens if you upgrade to afa78d7? I'm confused by how commit 94cd399 causes it.

No, upgrading to afa78d7 does not fix it, I cherry-picked the regression to e491c44. 94cd399 is the last working commit (it did not cause it, that was e491c44)

The upstream address is being replaced aggressively too early here with a fresh replacer (not from context) that doesn't have the matched SNI, causing it to be empty because it's ReplaceAll. It used to being replaced at the dialPeer call which happens inside the Handle method of the proxy handler.

@vnxme, any chance you can take a look?

@elcore, I have come up with a workaround. but it requires duplication. You can make a match/proxy-handle block for each upstream address.

Hmm, interesting 🤔. I think for now, pinning to the working commit is temporarily fine (after all it’s inside a Docker container), unless you’d say that not updating would be not wise.

Keep the pin. I like that you have a valid use case and a simple reproducer 🙂

@mohammed90 I'll look into the issue and propose a solution. @elcore, thank you for a valid use case example.

@elcore First of all, you have no certificate available for 'ud-chat.signal.org'-like errors because you have a route with a single tls handler before everything else in your list of routes. I think you don't need this route at all if you would like to proxy https (and assuming you don't have valid certificates for these domain names). Alternatively, it should be placed after the route where you match the third-party domains and proxy them.

with a fresh replacer (not from context) that doesn't have the matched SNI

@mohammed90 We've already discussed that, and I mentioned that the context is empty at provision. That's why there is no point in trying to extract the replacer from the context - it won't work, I've checked.

The upstream address is being replaced aggressively too early here

causing it to be empty because it's ReplaceAll.

But you are right stating that the upstream address gets empty in this issue because {l4.*} placeholders don't exist at provision. What do you think if we use ReplaceKnown there instead of ReplaceAll?

It’s a layered approach with TLS in TLS hence the above configuration. I currently cannot test your idea.

a layered approach with TLS in TLS

This way you have to use subroute. Like this:

{
	layer4 {
		:443 {
			route {
				tls
				subroute {
					@cf tls sni one.one.one.one
					route @cf {
						proxy 1.1.1.1:443
					}
				}
			}
		}
	}
}

I will certainly try this ASAP. Nonetheless I am surprised that the previous config is now broken with the latest commit(s). Sidenote: This seems like a Caddyfile configuration, right?

What do you think if we use ReplaceKnown there instead of ReplaceAll?

This would fix it.

This way you have to use subroute. Like this:

This approach is the workaround I suggested earlier, which requires duplication. Elcore's existing config takes advantage of the matched value being known at runtime without having to make a separate route/branch for every domain name.

Nonetheless I am surprised that the previous config is now broken with the latest commit(s).

I'm surprised your config actually used to work because it's conceptually wrong (missing subroute, as mentioned above). But you are right that {l4.tls.server_name}:443 should be an acceptable dial address, and we will fix that shortly.

This seems like a Caddyfile configuration, right?

Exactly. It's a part of the latest commits as well.

Elcore's existing config takes advantage of the matched value being known at runtime without having to make a separate route/branch for every domain name.

Yes, I can see that. I suppose you haven't got my idea. The final working (after a fix, and given Elcore needs TLS-in-TLS) config will be as follows:

{
	layer4 {
		:443 {
			route {
				tls
				subroute {
					@signal tls sni chat.signal.org ud-chat.signal.org storage.signal.org cdn.signal.org cdn2.signal.org cdn3.signal.org api.directory.signal.org cdsi.signal.org sfu.voip.signal.org svr2.signal.org updates.signal.org updates2.signal.org backend1.svr3.signal.org backend2.svr3.signal.org backend3.svr3.signal.org
					route @signal {
						proxy {l4.tls.server_name}:443
					}
				}
			}
		}
	}
}

I suppose you haven't got my idea

I get your idea, and saying the route is unnecessary.

@elcore Can you try xcaddy build --with github.com/mholt/caddy-l4=github.com/WeidiDeng/caddy-l4@prefetch-fix?

@WeidiDeng Your fix does not seem to work - Same issue as above
@vnxme Your fix does work.