mholt/caddy-l4

UDP upstreams changes are not detected on config reload

jtackaberry opened this issue ยท 15 comments

Suppose the following server definition (using a yaml config adapter):

apps:
  layer4:
    servers:
      udp_55353:
        listen:
        - udp/:55353
        routes:
        - handle:
          - handler: proxy
            upstreams:
            - dial:
              - udp/192.168.0.1:53

This works fine. 55353/udp is being forwarded to 192.168.0.1:53 as expected:

2023/07/28 01:27:06.072 DEBUG   layer4.handlers.proxy   dial upstream   {"remote": "127.0.0.1:34036", "upstream": "192.168.0.1:53"}

Now I update the config file and change the single upstream to a different one:

<snip>
            - dial:
              - udp/10.152.128.10:53

And reload the configuration dynamically with caddy reload:

2023/07/28 01:33:59.060 INFO    admin.api       received request        {"method": "POST", "host": "localhost:2019", "uri": "/load", "remote_ip": "127.0.0.1", "remote_port": "55880", "headers": {"Accept-Encoding":["gzip"],"Content-Length":["3539"],"Content-Type":["application/json"],"Origin":["http://localhost:2019"],"User-Agent":["Go-http-client/1.1"]}}
2023/07/28 01:33:59.063 INFO    admin   admin endpoint started  {"address": "localhost:2019", "enforce_origin": false, "origins": ["//localhost:2019", "//[::1]:2019", "//127.0.0.1:2019"]}
2023/07/28 01:33:59.065 DEBUG   layer4  listening       {"address": "udp/[::]:55353"}
2023/07/28 01:33:59.068 INFO    autosaved config (load with --resume flag)      {"file": "/home/tack/.config/caddy/autosave.json"}
2023/07/28 01:33:59.068 INFO    admin.api       load complete
2023/07/28 01:33:59.072 INFO    admin   stopped previous server {"address": "localhost:2019"}

But the updated upstream isn't recognized. New packets continue to go to the old, now-removed endpoint:

2023/07/28 01:36:44.250 DEBUG   layer4.handlers.proxy   dial upstream   {"remote": "127.0.0.1:46652", "upstream": "192.168.0.1:53"}

This is caddy-l4 built on July 21.

$ caddy version
v2.6.4 h1:2hwYqiRwk1tf3VruhMpLcYTg+11fCdr8S3jhNAdnPy8=
mholt commented

Interesting; thanks... I am a bit busy right now trying to get Caddy 2.7 ready to release, but if someone else (or yourself) would like to look into this before I can, that'd be very welcomed!

Either the old server is not actually getting shut down... or there's something funky with UDP listeners specifically. Or both. :)

Sure, I'll see if I can tackle it this weekend or next.

Good luck with the release. :)

mholt commented

Thank you ๐Ÿ˜€

@mholt so after a bit of poking, you're right that the UDP server isn't actually shutting down. After config reload, a new UDP server is started, but the previous one is still running. Each reload starts a new UDP server goroutine, all of them competing for packets on the same port, each dispatching to their own handlers based on the config at that point in time.

For each packetConn, App.Stop() calls Close(). The packetConns are of type *caddy.fakeClosePacketConn and there Close() says it will only actually close the underlying socket if there aren't any other references to it. Presumably since both App and Server both hold references, App.Stop() is ineffective and the socket never gets closed, which is required for the Server.servePacket() goroutine to terminate.

Given this, it doesn't look like a trivial fix, so I thought I'd pause here and see if you have any suggestions?

BTW, is there a reason that l4proxy's Handle() is using a value receiver rather than a pointer receiver?

mholt commented

Dang. I thought I fixed this a long time ago. Definitely have seen this before... I wonder, we did change how UDP packets are wrapped a bit in 2.7 -- if you upgrade to use Caddy 2.7.3, does the problem still occur? (I'm guessing it will, but I want to rule out anything easy first.)

BTW, is there a reason that l4proxy's Handle() is using a value receiver rather than a pointer receiver?

Not really, from the looks of it. It doesn't modify any of the values of the struct, so a pointer isn't necessary, maybe. (Or so I think. I got bit by this the other day though.)

if you upgrade to use Caddy 2.7.3, does the problem still occur? (I'm guessing it will, but I want to rule out anything easy first.)

Yep, this was all with a fresh build via xcaddy which was picking up caddy v2.7.3.

Not really, from the looks of it. It doesn't modify any of the values of the struct, so a pointer isn't necessary, maybe. (Or so I think. I got bit by this the other day though.)

It's probably innocuous (notwithstanding some initial confusion when I was sprinkling additional debug logs throughout the code), but it was just one of those "one of these things is not like the other" situations.

Seeing as this method is called per-packet with UDP, my initial thought was that copying the Handler struct is probably worth avoiding. On the other hand, the struct itself is just 5 pointers and a string, and considering we're spawning a new goroutine per packet, a value receiver is surely the least of our concerns performance-wise. :)

mholt commented

Ok, so I've been testing some things out and it does look like the listener itself is being cleaned up properly (in this case, reloading the config without changing the port leaves the UDP listener open, for zero-downtime reload); the problem is that the server isn't kicked out of its loop. I'll look into that. I coulda sworn it used to work...

the problem is that the server isn't kicked out of its loop

Yep, exactly. That's what I meant by "the socket never gets closed, which is required for the Server.servePacket() goroutine to terminate."

Presumably this is what App.Stop() is supposed to achieve by virtue of calling Close() on the socket, except Close() doesn't actually close it because there are multiple references, so the server loop never terminates.

It just wasn't clear to me how best to fix this, not really knowing the internals very well.

mholt commented

Ok, I'm fairly confident have a fix, but it requires a patch in Caddy. Which is fine btw -- I think it's an improvement. Just might need some cleanup.

I didn't realize that SO_REUSEPORT also works for UDP! https://lwn.net/Articles/542629/ -- maybe I just didn't think about it at the time because the reports were all with TCP and that's where I was focused.

So my fix uses SO_REUSEPORT, which is what we use for TCP sockets and servers, and it works marvelously AFAIK.

I will push a branch soon to this repo and the Caddy repo. Both will require some review and testing. Stay tuned!

Excellent! I'm very happy to take it out for a spin once you have the branches pushed.

mholt commented

@jtackaberry I think you can actually test this patch without any changes to caddy-l4:

$ xcaddy build udp-reuseport --with github.com/mholt/caddy-l4@master

(Edit: J/k, I think you do need one tiny change from this repo, so I put @master which I think will do the trick)

@jtackaberry I think you can actually test this patch without any changes to caddy-l4:

Looks good here! Reloads are now picking up changes to the upstreams.

And doing a tight loop of reloading configs, swapping between two of them with different upstream lists, caddy-l4's memory use remains stable, unlike before where memory use steadily climbs as we expect due to the accumulation of server loop goroutines.

Thanks very much!

(Off-topic stuff follows.)

Unrelated to this issue, though tying back to a previous comment I made above, are we at all concerned about the performance implications of creating a new goroutine for each UDP packet? For my immediate use case (DNS load balancing) I'm certainly not, but for high throughput cases like HTTP/3 the packet rates start to matter.

I did a hasty comparison that reads 50GB from /dev/zero in 1350 byte blocks (typical QUIC packet payload size) and compares a goroutine per packet versus a long-running goroutine that processes the packets via a channel. Adjusting for wall clock and CPU utilization, the goroutine-per-packet case does ~570MB/s per core, while the single-goroutine-with-channel case does ~1300MB/s per core.

I have to say, the goroutine-per-packet approach did a lot better than my intuition expected, but there's a fair bit of performance left on the table. I wonder if you've considered this already?

Having said that, for HTTP/3 it seems like a bit more work might be needed in caddy-l4, because we'll presumably want some sort of stickiness of CID to upstream. My use case with caddy-l4 is to act as an Internet-facing L4 load balancer in front of Kubernetes, which proxies a variety of traffic, leaning heavily on SNI-based routing, and most of it (though not all) is proxying for ingress-nginx in the cluster. I've built a custom K8s load balancer controller that generates configuration for caddy-l4 and it's been working delightfully, I have to say. (I originally used HAProxy, but its promise of hitless reloads is a complete lie, whereas caddy-l4 actually does the right thing.)

Openresty (which ingress-nginx uses) is only now just starting to work on HTTP/3 support, so this area is beginning to get interesting for me. :)

mholt commented

And doing a tight loop of reloading configs, swapping between two of them with different upstream lists, caddy-l4's memory use remains stable, unlike before where memory use steadily climbs as we expect due to the accumulation of server loop goroutines.

Awesome. Thank you for testing that, I remember when that happened. ๐Ÿ˜…

I did a hasty comparison that reads 50GB from /dev/zero in 1350 byte blocks (typical QUIC packet payload size) and compares a goroutine per packet versus a long-running goroutine that processes the packets via a channel. Adjusting for wall clock and CPU utilization, the goroutine-per-packet case does ~570MB/s per core, while the single-goroutine-with-channel case does ~1300MB/s per core.

Huh, thanks for doing that too. That's very interesting!

I have to say, the goroutine-per-packet approach did a lot better than my intuition expected, but there's a fair bit of performance left on the table. I wonder if you've considered this already?

I have not. Performance hasn't been a number-1 goal for this project since it inevitably will perform worse than a server that is not multiplexing protocols ๐Ÿ™ƒ

That said, I'm very open to making performance improvements! I'm not sure how a channel-based approach would work but if you can illustrate, maybe we can make adjustments. It clearly performs better...

Having said that, for HTTP/3 it seems like a bit more work might be needed in caddy-l4, because we'll presumably want some sort of stickiness of CID to upstream. My use case with caddy-l4 is to act as an Internet-facing L4 load balancer in front of Kubernetes, which proxies a variety of traffic, leaning heavily on SNI-based routing, and most of it (though not all) is proxying for ingress-nginx in the cluster. I've built a custom K8s load balancer controller that generates configuration for caddy-l4 and it's been working delightfully, I have to say. (I originally used HAProxy, but its promise of hitless reloads is a complete lie, whereas caddy-l4 actually does the right thing.)

Wow, this is... quite surprising actually! I'm thrilled to hear this!

We can definitely add support for stickiness. This module already has some support for various protocols, so it's OK if we add support for HTTP/3-specific features for example.

That said, I'm very open to making performance improvements! I'm not sure how a channel-based approach would work but if you can illustrate, maybe we can make adjustments. It clearly performs better...

I started poking at this with the intention of creating a PR to demonstrate the idea, but the closer I look at the UDP proxying code the more something doesn't seem right. I'll open a separate issue to continue the discussion there.

mholt commented

Sounds good, thanks!