haproxy/haproxy

Inbound connections stop and CLOSE_WAIT skyrockets

apenney-beeswax opened this issue · 98 comments

Detailed description of the problem

We're seeing a weird behavior that's been consistent over multiple versions of haproxy (2.1/2.2 and now 2.3) and I'm at the point where I suspect we might be hitting a strange bug.

We have a pool of haproxies that accept traffic from ad exchanges (lots of POSTs) and send those to backend servers to process. There's no humans involved, and normally traffic is very consistent.

Every so often we'll go from 20k established connections to a much higher number of connections, hundreds of thousands of CLOSE_WAIT connections, and then we'll run out of TCP memory on the node.

I've been working on tuning and trying to stop this from happening, and now I have a new weird scenario to run by you.

1: Server is operating normally, 50k established connections, plenty of traffic.
2: Suddenly new connections to :80 start to fail, dns healthchecks fail.
3: Existing connections continue to send traffic, which is sent to the backend, as normal.
4: Trying to curl from the box itself to http://localhost/ fails

10-4-18-176:~$ curl http://localhost/healthz
curl: (28) Failed to connect to localhost port 80: Connection timed out

Connections right now:

10-4-18-176:~$ ss | grep ESTAB | wc -l
37714
10-4-18-176:~$ ss | grep CLOSE-WAIT | wc -l
348901

These connections in CLOSE-WAIT never close. They just sit there until we give up and stop/start haproxy.

I managed to capture show sess and show fd but I'd like to share those privately if at all possible.

Expected behavior

One our healthcheck fails and incoming connections stop, I'd expect haproxy to close these CLOSE-WAITs and recover to the point it can start taking new :80 connections. This never happens however!

Steps to reproduce the behavior

I'm not sure at all I could give you reproduction steps. We can operate for hours/days in a steady state before suddenly hitting some kind of wall and going from 0 CLOSE-WAIT to hundreds of thousands and instantly broken.

Do you have any idea what may have caused this?

Not a clue, I can share metrics, configs, etc, but it happens suddenly and instantly.

Do you have an idea how to solve the issue?

Afraid not, all I've done is tune things and try to.

What is your configuration?

Some quick detail first - we pin irq's to the first 8 cores and then haproxy to the other 8:

global
  log /dev/log local0 warning alert
  log /dev/log local1 crit alert
  stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
  stats timeout 30s
  user haproxy
  group haproxy
  daemon
  maxconn 2000000
  # Default SSL material locations
  ca-base /etc/ssl/certs
  crt-base /etc/ssl/private
  # Default ciphers to use on SSL-enabled listening sockets.
  # For more information, see ciphers(1SSL). This list is from:
  #  https://hynek.me/articles/hardening-your-web-servers-ssl-ciphers/
  ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
  ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
  # Default ciphers to use on SSL-enabled listening sockets.
  # For more information, see ciphers(1SSL). This list is from:
  #  https://hynek.me/articles/hardening-your-web-servers-ssl-ciphers/
  ssl-default-server-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
  ssl-default-server-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
  cpu-map auto:1/1-8 8-15

resolvers localdns
  nameserver local1 127.0.0.1:8600
  # enable EDNS0 with larger accepted size to allow for large list of IPs
  accepted_payload_size 8192
  hold nx 40000s
  hold refused 40000s
  hold timeout 40000s
  hold other 40000s
  # do a DNS resolve every 5s both after successfull and failed resolves
  timeout resolve 5s
  timeout retry 5s
  # we keep trying for 300s total before DNS resolution fails
  resolve_retries 60

defaults
  log global
  mode http
  option httplog
  option dontlognull
  timeout connect 5000
  timeout client 120s
  timeout server 120s
  errorfile 400 /etc/haproxy/errors/400.http
  errorfile 403 /etc/haproxy/errors/403.http
  errorfile 408 /etc/haproxy/errors/408.http
  errorfile 500 /etc/haproxy/errors/500.http
  errorfile 502 /etc/haproxy/errors/502.http
  errorfile 503 /etc/haproxy/errors/503.http
  errorfile 504 /etc/haproxy/errors/504.http
  timeout queue 1s
  option splice-request
  option splice-response

frontend admin
  bind *:8100
  mode http
  maxconn 400
  option httplog
  option dontlognull
  log global
  stats uri /admin/stats

frontend stats
  bind *:8404
  mode http
  option http-use-htx
  http-request use-service prometheus-exporter if { path /metrics }
  stats enable
  stats uri /stats
  stats refresh 10s

rontend lb
  bind *:80 process 1/1
  bind *:80 process 1/2
  bind *:80 process 1/3
  bind *:80 process 1/4
  bind *:80 process 1/5
  bind *:80 process 1/6
  bind *:80 process 1/7
  bind *:80 process 1/8
  bind *:443 process 1/1 ssl crt /etc/haproxy_ssl/
  bind *:443 process 1/2 ssl crt /etc/haproxy_ssl/
  bind *:443 process 1/3 ssl crt /etc/haproxy_ssl/
  bind *:443 process 1/4 ssl crt /etc/haproxy_ssl/
  bind *:443 process 1/5 ssl crt /etc/haproxy_ssl/
  bind *:443 process 1/6 ssl crt /etc/haproxy_ssl/
  bind *:443 process 1/7 ssl crt /etc/haproxy_ssl/
  bind *:443 process 1/8 ssl crt /etc/haproxy_ssl/
  mode http
  default_backend fes
  maxconn 60000
  rate-limit sessions 12000
  timeout client-fin 1s
  timeout http-request 120s

backend fes
  mode http
  balance roundrobin
  option httpchk GET /healthz
  timeout server 120s
  timeout connect 10s
  http-reuse aggressive
  retries 0
  retry-on none
  http-request disable-l7-retry
  server-template fes- 400 fes.service.consul:80 check resolvers localdns resolve-prefer ipv4 inter 5s fastinter 1s init-addr none maxconn 360

Output of haproxy -vv and uname -a

A-Proxy version 2.3.0-1ppa1~focal 2020/11/12 - https://haproxy.org/
Status: stable branch - will stop receiving fixes around Q1 2022.
Known bugs: http://www.haproxy.org/bugs/bugs-2.3.0.html
Running on: Linux 5.4.0-1029-aws #30-Ubuntu SMP Tue Oct 20 10:08:09 UTC 2020 aarch64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -O2 -fdebug-prefix-map=/build/haproxy-nVVZGf/haproxy-2.3.0=. -fstack-protector-s
trong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wall -Wextra -Wdeclaration-
after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unus
ed-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wtype-limits -
Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_SYSTEMD=1

Feature list : +EPOLL -KQUEUE +NETFILTER -PCRE -PCRE_JIT +PCRE2 +PCRE2_JIT +POLL -PRIVATE_CACHE +TH
READ -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LI
BCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 -CLOSEFROM +ZLIB -SLZ +CPU_AFFINITY +TFO
 +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=16).
Built with OpenSSL version : OpenSSL 1.1.1f  31 Mar 2020
Running on OpenSSL version : OpenSSL 1.1.1f  31 Mar 2020
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.3
Built with network namespace support.
Built with the Prometheus exporter as a service
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate")
, gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.34 2019-11-21
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 9.3.0

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.
Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTTP       side=FE|BE     mux=H2
            fcgi : mode=HTTP       side=BE        mux=FCGI
       <default> : mode=HTTP       side=FE|BE     mux=H1
       <default> : mode=TCP        side=FE|BE     mux=PASS

Available services :
        prometheus-exporter

Available filters :
        [SPOE] spoe
        [CACHE] cache
        [FCGI] fcgi-app
        [COMP] compression
        [TRACE] trace

This is on ARM, but we see the same issue on regular intel instances:

Linux 10-4-18-176 5.4.0-1029-aws #30-Ubuntu SMP Tue Oct 20 10:08:09 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux

If HAProxy crashed: Last outputs and backtraces

(paste your output here)

Additional information (if helpful)

I'm suspecting it could be a slowdown of the connect() syscall trying to find a source port on the backend, slowing down the whole system and letting incoming connections accumulate. In this case the CLOSE_WAIT could be related to the clients timing out after sending their request, and aborting with only a FIN. The socket will then only be purged once the connection is accepted, processed and closed.

Are you seeing a CPU usage increase when this happens ?

Just thought I'd add that we've got net.ipv4.ip_local_port_range tuned to "1024 65535" to give as many ports as we can, I'm not sure how I'd confirm that we're actually out of source ports. The incoming traffic is from a wide range of ip's so assuming I understand how thing works (which isn't a given) we should have enough 4-tuple combinations to handle the situation.

Had another full outage and a bunch of fascinating information came out of this, I'm going to try and distill some into here:

1/ We saw the same thing while rolling out 2.3.1, massive CLOSE_WAIT
2/ We tried all sorts of things, stopping/starting but kept seeing the same behavior, a slow overwhelm of haproxy and then massive (up to 1.2M on some nodes) close_waits
3/ We tried experimenting with removing session limits and bumping maxconns thinking maybe our own limits caused these close_waits
4/ At one point i added a new machine and within 3 minutes it was broken:

root@lb-eu-west-1-ip-10-3-19-224:/home/ubuntu# ss | grep CLOSE | wc -l
479062
root@lb-eu-west-1-ip-10-3-19-224:/home/ubuntu# ss | grep CLOSE | wc -l
482874
root@lb-eu-west-1-ip-10-3-19-224:/home/ubuntu# uptime
 14:53:11 up 3 min,  1 user,  load average: 7.69, 2.93, 1.08
root@lb-eu-west-1-ip-10-3-19-224:/home/ubuntu# ss | grep ESTAB | wc -l
76600
 
same time period:
  
[Mon Nov 16 14:55:04 2020] TCP: too many orphaned sockets
[Mon Nov 16 14:55:04 2020] TCP: too many orphaned sockets
[Mon Nov 16 14:55:04 2020] TCP: too many orphaned sockets
[Mon Nov 16 14:55:04 2020] TCP: too many orphaned sockets
[Mon Nov 16 14:55:04 2020] TCP: too many orphaned sockets
[Mon Nov 16 14:55:04 2020] TCP: too many orphaned sockets
[Mon Nov 16 14:55:04 2020] TCP: too many orphaned sockets
[Mon Nov 16 14:55:04 2020] TCP: too many orphaned sockets
[Mon Nov 16 14:55:04 2020] TCP: too many orphaned sockets
[Mon Nov 16 14:55:04 2020] TCP: too many orphaned sockets
[Mon Nov 16 14:55:09 2020] net_ratelimit: 77596 callbacks suppressed

So within 3 minutes we already had massive TCP issues.

I managed to grab a perf flamegraph from this timeframe (and did some strace digging) hoping to see something related to select() or something to give me a clue.

Captured with sudo perf record -F 99 -p 5412 -g -- sleep 60 I get graphs that look like:

Screen Shot 2020-11-16 at 2 23 34 PM

Is it possible we're just overwhelmed with SSL connections and haproxy just sheds them in some way (the close-waits?). I don't know how else to explain ramping up to ~500k close_waits within a minute or two of coming online. The client side of haproxy is a bunch of automated exchanges, and they can definitely be very aggressive in terms of trying to establish enough connections.

It's possible by just perf'ing the process I'm seeing only a fraction of the real activity on the box, but I didn't expect to see so much SSL.

From a recovered standpoint:

Screen Shot 2020-11-16 at 2 58 05 PM

SSL isn't even visible in this graph when we're operating normally.

Ah that's pretty interesting. Indeed SSL is being caught by the finger in the jam pot. But I'm not that sure it's entirely the culprit here. Your too many of orphans messages indicate that the kernel had to destroy responses to clients because the number of orphans was too high. An orphan is an established (or FIN_WAIT) connection that is lingering in the system, still delivering data to the other end while it's already closed in the process. Typically the last blocks of a response before a close. And the fact that the system destroys them results in truncated responses for your clients, which possibly count the connection as being in error and make them want to redo a full handshake (which is expensive). Please have a look at /proc/sys/net/ipv4/tcp_max_orphans. Usually it's quite small (64k). You should make it at least as large as your global maxconn (which is really huge in your case and probably not justified).

In addition, you need to avoid the accumulation effect that happens here: your clients establish a connection, which is accepted by the kernel. The processing of this connection is delayed by the temporary lack of spare CPU resources, the client aborts, and this connection still adds to the load and delays further connections. If you see 500k CLOSE_WAIT, it's because you accept them and instead they must be blocked by limiting the size of the backlog, which defaults to the same value as maxconn. Here since your maxconn is 60k, you have a 60k backlog for each listener, time 8 it gives you the 500k you've observed. Please set a much lower value like "backlog 1000" in your frontend. That should be amply enough.

Regarding the point on limited number of source ports, it's irrelevant to the number of clients, it's related to the number of available ports to reach a given destination from the haproxy machine's source. The simple fact that it recovers tends to tell me that it is not the problem however, so we can keep that track away.

I very much appreciate this detailed response! I'm rolling out a change to tcp_orphans (it was 131072) and the lower backlog limit. I'll update this issue (and resolve it) if this helps!

Would we see any benefits from splitting SSL off to a new frontend with a lower "rate-limit sessions 200" setting. My thinking was this would help keep processing of new SSL connections down and stop it from consuming all cpu.

[Oh and side note I'll dig into why the maxconns is so big, it predates me, but I'm sure we could drop that down too].

No, you must never ever use rate-limit sessions. It will lower your accept rate below what your machine is capable of. This is only used by shared hosting providers who want to prevent a single customer from saturating the whole machine. In your case you want SSL to be accepted as fast as possible. However, limiting the backlog ensures that those queued beyond a certain limit that you know you won't process in time will be dropped. This is much more efficient.

I just wanted to say thank you, so far this seems to be making a big difference (all the various changes combined). One of the clients 95th percentile dropped from 3500ms to "can't see it on the graph, but < 50ms" with this latest set of changes.

I don't know if this has all worked to fix our original issue (where we get overwhelmed), but we're in much better shape and overnight we'll get to see how things look (as morning hits in europe and traffic starts to scale up).

Once again, I appreciate all your help!

Great! You could probably improve the situation a little bit more and simplify your config at the same time:

  • you should remove the per-thread "bind" lines, just keep a single bind :80 and a single bind :443, removing the process directive. This will let haproxy assign inbound connections to the least loaded thread instead of having the kernel hash on the source port ;
  • you should remove option splice-request and option splice-response if you're dealing with small objects, as these ones are really useful only beyond 64kB
  • if your average object size is close to the buffer size, you can try to slightly increase it (tune.bufsize in the global section), good results are often reported between 16 and 64kB (16k being the default).
  • maybe your client supports HTTP/2. If so, try adding "alpn h2,http/1.1" on your bind :443 lines.
  • given that you seem to have many servers, you may be interested in trying other LB algorithms, like leastconn or random, which might give better tail latencies.
  • depending on the application, you may want to try with "http-reuse always", which could further increase the HTTP reuse rate. It will require that the client accepts an error once in a while (whenever a server decides to close after pretending to keep the connection alive) but 1) it seems from your description that this is not an issue at all with your type of traffic, and 2) this could be covered using l7 retries if needed.

You can also drop "http-use-htx", as it's the default in 2.0 and the only supported mode since 2.1. In short there is no supported version anymore having this option where this is not the default :-)

I've applied almost all of these (minus tweaking the buffers, and http2, because none of our clients support it yet), and I've got this baking in our test environment. I feel like I keep repeating this, but I can't thank you enough for all these suggestions.

We're doing pretty good overall now, we still need to figure out how to benchmark TLS ciphers for these ARM cpus (in normal circumstances we're burning almost 2 cores on some openssl function right now after we tweaked the ciphers and switched to client-prefer) so we can optimize that a little bit more, but we've managed to survive the overnight scale up of traffic, so I think the big problems are resolved at this point! (If you know of an easy way to figure out what SSL ciphers are fastest on a cpu/architecture, let me know, because I think I'm going to have to do some painful testing over days)

I was feeling good about things but I have a horrible update @wtarreau - crashes, and lots of them:

Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]: Thread 5 is about to kill the process.
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:  >Thread 1 : id=0xffff856a1010 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=0 rqsz=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cpu_ns: poll=334689009511 now=337069231170 diff=2380221659
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              curr_task=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              call trace(8):
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae71d1e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xffff860055b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:  >Thread 2 : id=0xffff85695a40 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=0 rqsz=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cpu_ns: poll=342489431690 now=344870619199 diff=2381187509
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              curr_task=0xffff48789570 (task) calls=2 last=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:                fct=0xaaaae70db4b0(process_stream) ctx=0xffff60226b30
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              strm=0xffff60226b30 src=35.195.8.243 fe=lb_ssl be=fes dst=fes-43
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              rqf=4c800002 rqa=8000 rpf=80000000 rpa=0 sif=EST,200028 sib=ASS,2000130
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              af=(nil),0 csf=0xffff47d64330,104000
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              ab=(nil),0 csb=(nil),0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cof=0xffff68153580,80001300:H1(0xffff52af9670)/SSL(0xffff64f212e0)/tcpv4(11367)
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              call trace(12):
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae71d1e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xffff860055b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae710eda8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae710eda8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae7110e2c [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae70dbd38 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cbc
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:  >Thread 3 : id=0xffff848b5a40 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cpu_ns: poll=341700111277 now=344054310256 diff=2354198979
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              curr_task=0xaaaae9b47800 (task) calls=1 last=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:                fct=0xaaaae70db4b0(process_stream) ctx=0xffff8016a020
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              strm=0xffff8016a020 src=85.91.44.181 fe=lb be=fes dst=fes-43
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000130
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              af=(nil),0 csf=0xaaaaee57fcf0,104000
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              ab=(nil),0 csb=(nil),0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cof=0xffff4c5865b0,80001300:H1(0xffff610c90a0)/RAW((nil))/tcpv4(17218)
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              call trace(12):
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae71d1e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xffff860055b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae710eda8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae710eda8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae7110e2c [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae70dbd38 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cbc
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:  >Thread 4 : id=0xffff7a087a40 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=2 rqsz=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cpu_ns: poll=343776327076 now=346149106223 diff=2372779147
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              curr_task=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              call trace(8):
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae71d1e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xffff860055b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae719e3f0 [fc 03 00 aa c0 e8 ff b5]: listener_accept+0x588/0x2214
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae7055bac [eb 37 40 f9 ec 47 40 f9]: main+0x35e4
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]: *>Thread 5 : id=0xffff79886a40 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=1 rqsz=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cpu_ns: poll=336862122523 now=339235509581 diff=2373387058
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              curr_task=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              call trace(12):
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae7210ea8 [f5 ff ff 17 c0 b2 00 91]: wdt_handler+0x158/0x194
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xffff860055b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae719e3f0 [fc 03 00 aa c0 e8 ff b5]: listener_accept+0x588/0x2214
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae7055bac [eb 37 40 f9 ec 47 40 f9]: main+0x35e4
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:  >Thread 6 : id=0xffff79085a40 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=6 rqsz=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cpu_ns: poll=341231926935 now=343614222636 diff=2382295701
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              curr_task=0xffff64246d20 (task) calls=1 last=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:                fct=0xaaaae70db4b0(process_stream) ctx=0xffff65309580
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              strm=0xffff65309580 src=34.77.57.13 fe=lb_ssl be=fes dst=fes-15
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000130
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              af=(nil),0 csf=0xffff6ae90ff0,104000
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              ab=(nil),0 csb=(nil),0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cof=0xaaaaef76ed90,80001300:H1(0xfffef12d5c80)/SSL(0xffff0d36efd0)/tcpv4(10744)
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              call trace(14):
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae71d1e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xffff860055b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae70ec1cc [53 0e 00 90 42 0e 00 90]: main+0x99c04
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae70ec1cc [53 0e 00 90 42 0e 00 90]: main+0x99c04
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae7109cf0 [20 07 00 34 e4 f2 40 f9]: main+0xb7728
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae710eda8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae7110e2c [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae70dbd38 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cbc
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:  >Thread 7 : id=0xffff78884a40 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=19 rqsz=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cpu_ns: poll=327207873416 now=329585629672 diff=2377756256
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              curr_task=0xaaaae9f26a90 (task) calls=1 last=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:                fct=0xaaaae70db4b0(process_stream) ctx=0xffff3a009430
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              strm=0xffff3a009430 src=185.33.223.53 fe=lb be=fes dst=fes-8
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000130
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              af=(nil),0 csf=0xffff6001fd80,104000
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              ab=(nil),0 csb=(nil),0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cof=0xffff50ff5f80,80001300:H1(0xaaaaf7ae69c0)/RAW((nil))/tcpv4(13475)
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              call trace(12):
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae71d1e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xffff860055b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae710eda8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae710eda8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae7110e2c [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae70dbd38 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cbc
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:  >Thread 8 : id=0xffff78083a40 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=3 rqsz=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              cpu_ns: poll=327688563725 now=330070334552 diff=2381770827
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              curr_task=0
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              call trace(8):
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae71d1e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xffff860055b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae719e3f0 [fc 03 00 aa c0 e8 ff b5]: listener_accept+0x588/0x2214
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30408]:              | 0xaaaae7055bac [eb 37 40 f9 ec 47 40 f9]: main+0x35e4
Nov 19 10:01:37 fes-lb-eu-west-1-ip-10-3-16-208 haproxy[30406]: [NOTICE] 323/100137 (30406) : haproxy version is 2.3.1-1ppa1~focal

There's something really really odd here, it appears that all threads were stuck in a busy loop that seriously looks like a deadlock (2.3 seconds) at the same time, and at different places. They're not that different by the way, all of them are either connect() or accept() it seems, which is extremely concerning. It makes me think that something deadlocks there. One thing that's common to the accept and connect paths is the FD allocation. I don't see how that would be an issue except if allocating one FD can take such huge amounts of time for an unknown reason. Your FD numbers are not extreme, around 17k for the largest I'm seeing here, nothing too difficult that would justify taking lots of time. Are your other traces similar ?

Pretty much, if I pick a different machine, different time:

ov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:  >Thread 5 : id=0xffff9b710a40 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=1 rqsz=0
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              cpu_ns: poll=4480472149291 now=4482639789559 diff=2167640268
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              curr_task=0
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              call trace(8):
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab4313e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xffffa80a65b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab42e03f0 [fc 03 00 aa c0 e8 ff b5]: listener_accept+0x588/0x2214
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab4197bac [eb 37 40 f9 ec 47 40 f9]: main+0x35e4
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:  >Thread 6 : id=0xffff9af0fa40 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              cpu_ns: poll=4403750610638 now=4405911056576 diff=2160445938
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              curr_task=0
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              call trace(8):
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab4313e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xffffa80a65b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab42e03f0 [fc 03 00 aa c0 e8 ff b5]: listener_accept+0x588/0x2214
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab4197bac [eb 37 40 f9 ec 47 40 f9]: main+0x35e4
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:  >Thread 7 : id=0xffff9a70ea40 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=5 rqsz=0
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              cpu_ns: poll=4441616007002 now=4443792796366 diff=2176789364
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              curr_task=0xffff2df4fc90 (task) calls=1 last=0
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:                fct=0xaaaab421d4b0(process_stream) ctx=0xffff53546be0
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              strm=0xffff53546be0 src=35.223.81.206 fe=lb_ssl be=fes dst=fes-63
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000130
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              af=(nil),0 csf=0xffff68e698e0,104000
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              ab=(nil),0 csb=(nil),0
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              cof=0xffff47b084b0,80001300:H1(0xffff66442130)/SSL(0xffff67f8f4c0)/tcpv4(19690)
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              call trace(14):
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab4313e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xffffa80a65b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab422e1cc [53 0e 00 90 42 0e 00 90]: main+0x99c04
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab422e1cc [53 0e 00 90 42 0e 00 90]: main+0x99c04
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab424bcf0 [20 07 00 34 e4 f2 40 f9]: main+0xb7728
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab4250da8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab4252e2c [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab421dd38 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cbc
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]: *>Thread 8 : id=0xffff99f0da40 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=0
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              cpu_ns: poll=4434050130357 now=4436226299131 diff=2176168774
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              curr_task=0xffff8f0ddf90 (task) calls=1 last=0
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:                fct=0xaaaab421d4b0(process_stream) ctx=0xaaaae74afd80
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              strm=0xaaaae74afd80 src=85.91.46.55 fe=lb be=fes dst=fes-9
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              rqf=4d00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000030
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              af=(nil),0 csf=0xffff4724a2b0,108000
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              ab=(nil),0 csb=(nil),0
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              cof=0xffff163a04a0,80001300:H1(0xffff30dfc620)/RAW((nil))/tcpv4(24577)
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              call trace(16):
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab4352ea8 [f5 ff ff 17 c0 b2 00 91]: wdt_handler+0x158/0x194
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xffffa80a65b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab4250da8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab4250da8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab4252e2c [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Nov 19 12:38:55 fes-lb-eu-west-1-ip-10-3-20-27 haproxy[31099]:              | 0xaaaab421dd38 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cbc

It's pretty weird, I'm only seeing it in one region (another region with the exact same config, version of haproxy, etc, doesn't show this). We're in the middle of rotating the cluster to try and clear them (I tweaked the ssl backlog a little lower, shuffled some ciphers around, nothing major, just tiny tweaks) and we're hoping that will continue to help.

[Side note: we're in the middle of adding an ecdsa certificate hoping that this'll bring down the time we spent in TLS. Amazon confirmed their graviton2 cpus don't do RSA well and we have an RSA 2048 cert, so hopefully dropping time spent here will help haproxy survive]

(rotated instances, crashes still occurring, so this wasn't a one time weird thing, latest crash was on thread 7):

Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]: *>Thread 7 : id=0xffff932e1a40 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=38 rqsz=0
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              cpu_ns: poll=554623654039 now=556673111472 diff=2049457433
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              curr_task=0xffff8d0e8a20 (task) calls=1 last=0
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:                fct=0xaaaadeabd4b0(process_stream) ctx=0xffff86357470
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              strm=0xffff86357470 src=213.19.162.106 fe=lb be=fes dst=fes-109
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000130
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              af=(nil),0 csf=0xffff87983fb0,104000
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              ab=(nil),0 csb=(nil),0
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              cof=0xffff806c84f0,80001300:H1(0xffff99507e50)/RAW((nil))/tcpv4(7397)
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              call trace(18):
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              | 0xaaaadebf2ea8 [f5 ff ff 17 c0 b2 00 91]: wdt_handler+0x158/0x194
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              | 0xffffa0bf15b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              | 0xaaaadeace1cc [53 0e 00 90 42 0e 00 90]: main+0x99c04
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              | 0xaaaadeace1cc [53 0e 00 90 42 0e 00 90]: main+0x99c04
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              | 0xaaaadeaebcf0 [20 07 00 34 e4 f2 40 f9]: main+0xb7728
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              | 0xaaaadeaf0da8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              | 0xaaaadeaf2e2c [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Nov 19 14:08:43 fes-lb-eu-west-1-ip-10-3-19-46 haproxy[2868]:              | 0xaaaadeabdd38 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cbc

It's systematically in the network code. This is really scary. Do you have a core ? If so, it would be useful to get a full back trace, especially for the latest one showing only one thread.

Could you try with less threads to see what it changes ? Especially once switched to ECDSA, crypto will be roughly 15 times faster, you can use less threads. I'm having a hard time figuring if there's some form of excessive locking in the libc around syscalls or if something is odd in the system itself. Just out of curiosity, are the conntrack modules loaded, and if so, it would be useful to check if the hash size is large enough (you can find it in /sys/modules/*conntrack*/parameters/*hashsize*, sorry for not having the exact names in mind). Normally this should only affect the connect() code, but we could imagine that a blocked connect maintains a lock that prevents accept() from working as well.

Do you have logs on your haproxy node ? If so could you check whether haproxy logged anything during the 2-3 seconds that preceeded the crash, especially on the ones that indicate all threads were stuck ? The idea here is that it could also be possible that there's a bug in the CPU usage accounting on this kernel and that it triggers way too early. Here it says each thread has been spending uninterrupted 2 to 3s of CPU in a single function, so if that's true, haproxy will not have logged anything during that time. If you see some activity it will mean that there's a reporting problem.

Based on your use CPU usage reports I guess the reporting is valid and that something in the system regularly slows everything down to the point of pausing everything. This could look like some heavy table scans for urgent flushes (e.g. LVS used to need this by flushing many expired entries, I don't think conntrack does).

I didn't get a chance to try with less threads, but we did roll out ECDSA which immediately stopped all crashing. I think this was definitely an issue with resource starvation (which makes sense, we know we use 100% of all cores during the time we crash).

We don't load conntrack (I actually need to consider switching that on, we're moving to a monitoring system that uses conntrack for most network metrics, so I know it's definitely off right now).

Unfortunately we don't take any logs (the volume is too big), so there wasn't anything else I could check out. It feels less than ideal to say "well, we fixed it by using faster crypto" but it made an absolutely enormous difference on these CPUs and most of our pain seems to have vanished instantly.

OK good to know. Maybe your processors are getting hot when hammered too much and their frequency is significantly lowered to a point where it's basically not working anymore. But that would be surprising in a cloud environment, and it doesn't explain why only syscalls would be affected.

Regarding logs, you can enable sampling and emit for example 1/100 only. That's pretty convenient because it still lets you know whether things are normal or not, and what happened before an incident.

Just to pick this up a little bit - I thought we were looking good (because generally we've been much better) but I realized we're still reliably crashing over and over:

Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]: *>Thread 7 : id=0xffff7b2e1a40 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=2 rqsz=0
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              cpu_ns: poll=229753501297 now=231816767485 diff=2063266188
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              curr_task=0xffff7417e7a0 (task) calls=1 last=0
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:                fct=0xaaaab6d234b0(process_stream) ctx=0xffff810a0b50
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              strm=0xffff810a0b50 src=64.63.5.79 fe=lb be=fes dst=fes-16
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS
,2000130
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              af=(nil),0 csf=0xffff70d0d3f0,4000
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              ab=(nil),0 csb=(nil),0
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              cof=0xffff807dc590,80001300:H1(0xffff6cbbd710)/RAW((nil))/tcpv
4(24059)
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              call trace(16):
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              | 0xaaaab6e58ea8 [f5 ff ff 17 c0 b2 00 91]: wdt_handler+0x158/
0x194
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              | 0xffff88f7a5b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kerne
l_rt_sigreturn
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              | 0xaaaab6d56da8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x1
5a8/0x27b4
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              | 0xaaaab6d56da8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x1
5a8/0x27b4
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              | 0xaaaab6d58e2c [74 06 40 f9 94 07 00 b4]: back_try_conn_req+
0x114/0x71c
Nov 24 14:50:30 fes-lb-eu-west-1-ip-10-3-22-59 haproxy[1720868]:              | 0xaaaab6d23d38 [80 03 40 39 1f 04 00 71]: process_stream+0x8
88/0x5cbc

This is with much reduced CPU (so we're not maxing out constantly), but we still crash routinely.

@wtarreau Is it possible that setting ssl-mode-async would potentially help here? My theory being that we might unblock the thread fast enough to not die? I'm going to experiment in the meantime with log sampling (I might have to use very aggressive sampling settings given the amount of qps we push through these machines, we're doing about 33k QPS per node right now).

Just to briefly update, I tried a round of tweaks including ssl-mode-async:

ssl-mode-async
tune.bufsize 32768
tune.ssl.cachesize 60000
tune.ssl.lifetime 3600 # we have really long lived connections, we don't want to renegotiate constantly

I added tfo and option tcp-smart-connect as well as they seemed relatively risk free.

Still resulted in:

Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]: Thread 7 is about to kill the process.
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:  >Thread 1 : id=0xffffb3345010 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=0 rqsz=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cpu_ns: poll=266172273915 now=268250921953 diff=2078648038
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              curr_task=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              call trace(8):
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace663e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xffffb3ca95b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:  >Thread 2 : id=0xffffb3339a40 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cpu_ns: poll=265233893445 now=267217985191 diff=1984091746
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              curr_task=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              call trace(8):
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace663e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xffffb3ca95b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace6303f0 [fc 03 00 aa c0 e8 ff b5]: listener_accept+0x588/0x2214
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace4e7bac [eb 37 40 f9 ec 47 40 f9]: main+0x35e4
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:  >Thread 3 : id=0xffffb0c9ea40 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=16 rqsz=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cpu_ns: poll=273532692798 now=275608355079 diff=2075662281
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              curr_task=0xaaaae98a8e40 (task) calls=1 last=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:                fct=0xaaaace56d4b0(process_stream) ctx=0xffff98708c60
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              strm=0xffff98708c60 src=64.63.3.209 fe=lb be=fes dst=fes-44
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000130
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              af=(nil),0 csf=0xffffac08ddf0,104000
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              ab=(nil),0 csb=(nil),0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cof=0xffff9c14a350,80001300:H1(0xffffac83c820)/RAW((nil))/tcpv4(37309)
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              call trace(12):
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace663e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xffffb3ca95b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace5a0da8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace5a0da8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace5a2e2c [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace56dd38 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cbc
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:  >Thread 4 : id=0xffffa739ea40 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=6 rqsz=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cpu_ns: poll=267911270133 now=269980205392 diff=2068935259
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              curr_task=0xffff8436bdb0 (task) calls=1 last=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:                fct=0xaaaace56d4b0(process_stream) ctx=0xffff8428dc80
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              strm=0xffff8428dc80 src=37.252.172.141 fe=lb be=fes dst=fes-21
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              rqf=4d00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000030
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              af=(nil),0 csf=0xffff90de8ef0,108000
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              ab=(nil),0 csb=(nil),0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cof=0xffff8c718ca0,80001300:H1(0xffff842037c0)/RAW((nil))/tcpv4(29018)
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              call trace(12):
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace663e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xffffb3ca95b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace5a0da8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace5a0da8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace5a2e2c [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace56dd38 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cbc
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:  >Thread 5 : id=0xffffa6b9da40 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cpu_ns: poll=267770911673 now=269801199367 diff=2030287694
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              curr_task=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              call trace(8):
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace663e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xffffb3ca95b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace6303f0 [fc 03 00 aa c0 e8 ff b5]: listener_accept+0x588/0x2214
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace4e7bac [eb 37 40 f9 ec 47 40 f9]: main+0x35e4
ov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:  >Thread 6 : id=0xffffa639ca40 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=6 rqsz=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cpu_ns: poll=281133305496 now=283188100537 diff=2054795041
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              curr_task=0xffff8c91fae0 (task) calls=1 last=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:                fct=0xaaaace56d4b0(process_stream) ctx=0xffff9d5ecfd0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              strm=0xffff9d5ecfd0 src=85.91.47.59 fe=lb be=fes dst=fes-66
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000130
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              af=(nil),0 csf=0xffff8c783c00,104000
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              ab=(nil),0 csb=(nil),0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cof=0xaaaae8aeecb0,80001300:H1(0xffff90426de0)/RAW((nil))/tcpv4(32486)
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              call trace(12):
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace663e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xffffb3ca95b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace5a0da8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace5a0da8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace5a2e2c [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace56dd38 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cbc
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]: *>Thread 7 : id=0xffffa5b9ba40 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=9 rqsz=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cpu_ns: poll=270985613973 now=273066982537 diff=2081368564
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              curr_task=0xffff8c6d2000 (task) calls=1 last=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:                fct=0xaaaace56d4b0(process_stream) ctx=0xffff9c163b70
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              strm=0xffff9c163b70 src=185.86.136.108 fe=lb be=fes dst=fes-27
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000130
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              af=(nil),0 csf=0xffff9072ef50,104000
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              ab=(nil),0 csb=(nil),0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cof=0xffff8c9ebf70,80001300:H1(0xaaaae80f0100)/RAW((nil))/tcpv4(23037)
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              call trace(18):
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace6a2ea8 [f5 ff ff 17 c0 b2 00 91]: wdt_handler+0x158/0x194
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xffffb3ca95b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace57e1cc [53 0e 00 90 42 0e 00 90]: main+0x99c04
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace57e1cc [53 0e 00 90 42 0e 00 90]: main+0x99c04
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace59bcf0 [20 07 00 34 e4 f2 40 f9]: main+0xb7728
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace5a0da8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace5a2e2c [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace56dd38 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cbc
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:  >Thread 8 : id=0xffffa539aa40 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=4 rqsz=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              stuck=1 prof=0 harmless=0 wantrdv=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cpu_ns: poll=257534531142 now=259604152808 diff=2069621666
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              curr_task=0xaaaae844e2f0 (task) calls=1 last=0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:                fct=0xaaaace56d4b0(process_stream) ctx=0xffff84b15ee0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              strm=0xffff84b15ee0 src=34.76.9.162 fe=lb_ssl be=fes dst=fes-20
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000130
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              af=(nil),0 csf=0xffff9caa5530,104000
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              ab=(nil),0 csb=(nil),0
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cof=0xffff9c85adf0,80001300:H1(0xffff90076ea0)/SSL(0xffff91143fe0)/tcpv4(13474)
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              call trace(12):
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace663e0c [00 07 00 f0 63 a2 23 91]: debug_handler+0x84/0x15c
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xffffb3ca95b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace5a0da8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace5a0da8 [24 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27b4
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace5a2e2c [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Nov 24 16:40:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[2999]:              | 0xaaaace56dd38 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cbc

So we know none of that stuff made a significant difference. Next I'll go back to figuring out sampled logging.

With log sampling, I do see a gap that matches the deadlock time:

Nov 24 17:36:26 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[20835]: 18.156.172.151:42822 [24/Nov/2020:17:36:26.461] lb_ssl~ fes/fes-7 0/0/0/2/2 204 95 - - ---- 29074/4536/121/1/0 0/0 "POST /bid/at HTTP/1.1"
Nov 24 17:36:28 fes-lb-eu-west-1-ip-10-3-22-98 haproxy[20835]: Thread 5 is about to kill the process.

We had multiple logs (10 or so) a second before that time period, so it definitely was not processing anything for the full two seconds.

Thanks a lot for these new outputs. So they confirm that the process really gets stuck for more than two seconds. Could you please provide a gdb backtrace of one of these crashed processes so that we have more detailed context about how the problem happens ? I still can't imagine we're stuck that long in a syscall, so all that remains is a bug in haproxy causing us to loop, and I have no idea how this is possible.

Regarding ssl-async-mode, it's unrelated. This one is used only when you load an external crypto engine. It was a good idea to try tcp-smart-connect, even if it did not help, it can help us eliminate some candidate code paths.

I'm seeing two places in connect_server() where we could possibly spend some time, one of them is the list_for_each_entry() loop looking for a connection, but for it to take a lot of time, it would mean having millions of idle connections, which sounds strange, and the other one is the attempt to kill excess connections later in the function, which is done under the idle_conn lock, where we could possibly be waiting for another thread to finish something. It also sounds strange. And of course there's the si_connect() call.

I do have one interesting bit of information (I'll try to get a gdb backtrace as requested), I just switched one machine from arm to intel and so far I've seen no crashes in ~12 hours of running.

I checked this morning and a random arm box has crashed 3 times in the last 3 hours, so this seems pretty significant. I want to give it another 3-4 hours just to be sure I wasn't getting lucky, but this may be related to some kind of arm vs intel thing.

It would be rather surprising but could be related to the libc's implementation of the syscalls, or an issue in the kernel. I do use ARM quite a bit as well and never encountered that case yet.

OK, I have a bunch of coredumps. I'm not an expert gdb user, so I might need guidance, but here's a quick look:

(No debugging symbols found in /usr/sbin/haproxy)
[New LWP 2024574]
[New LWP 2024575]
[New LWP 2024577]
[New LWP 2024576]
[New LWP 2024578]
[New LWP 2024580]
[New LWP 2024579]
[New LWP 2024581]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/atomics/libthread_db.so.1".
Core was generated by `/usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -S /run/h'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0xffffb1fb5010 (LWP 2024574))]
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000ffffb21f1d54 in __GI_abort () at abort.c:79
#2  0x0000aaaad22c31ac in ha_panic ()
#3  0x0000aaaad2303ea8 in wdt_handler ()
#4  <signal handler called>
#5  0x0000aaaad22aac98 in process_runnable_tasks ()
#6  0x0000aaaad227e748 in run_poll_loop ()
#7  0x0000aaaad227eb98 in ?? ()
#8  0x0000aaaad2145f0c in main ()
(gdb) info threads
  Id   Target Id                           Frame
* 1    Thread 0xffffb1fb5010 (LWP 2024574) __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
  2    Thread 0xffffb1fa9a40 (LWP 2024575) 0x0000aaaad2292d0c in listener_accept ()
  3    Thread 0xffffa67cba40 (LWP 2024577) 0x0000aaaad2292d08 in listener_accept ()
  4    Thread 0xffffa739ea40 (LWP 2024576) 0x0000aaaad2292d0c in listener_accept ()
  5    Thread 0xffffa5fcaa40 (LWP 2024578) 0x0000aaaad2292d0c in listener_accept ()
  6    Thread 0xffffa4fc8a40 (LWP 2024580) 0x0000aaaad2292d08 in listener_accept ()
  7    Thread 0xffffa57c9a40 (LWP 2024579) 0x0000aaaad2292d0c in listener_accept ()
  8    Thread 0xffffa47c7a40 (LWP 2024581) 0x0000aaaad22a9244 in task_kill ()
(gdb) thread 8
[Switching to thread 8 (Thread 0xffffa47c7a40 (LWP 2024581))]
#0  0x0000aaaad22a9244 in task_kill ()
(gdb) bt
#0  0x0000aaaad22a9244 in task_kill ()
#1  0x0000aaaad21df1cc in ?? ()
#2  0x0000aaaad21fccf0 in ?? ()
#3  0x0000aaaad2201da8 in connect_server ()
#4  0x0000aaaad2203e2c in back_try_conn_req ()
#5  0x0000aaaad21ced38 in process_stream ()
#6  0x0000aaaad22aa238 in run_tasks_from_lists ()
#7  0x0000aaaad22aad10 in process_runnable_tasks ()
#8  0x0000aaaad227e748 in run_poll_loop ()
#9  0x0000aaaad227eb98 in ?? ()
#10 0x0000ffffb28233c8 in start_thread (arg=0xffffeac27fef) at pthread_create.c:477
#11 0x0000ffffb22a0f0c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

(I picked thread 8 as the only other one not in listener_accept())

Happy to do whatever is useful now, including just sending these coredumps to you somehow, I have several of them locally at this point.

Lack of symbols probably makes this unhelpful, I tried to get more info:

(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {8192, 187650648207360, 65540, 281473667782416, 0, 1, 0, 281473677592728, 18446744073709551615, 0,
            16384, 1, 281474976710655, 281474620353120, 281473670126852, 281473671466776}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x0000ffffb21f1d54 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0, 0,
              18446744067267100671, 18446744073709551615, 18446744073709551615, 18446744073709551615,
              18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615,
              18446744073709551615, 18446744073709551615, 18446744073709551615, 281474620353440, 187650647273896,
              281473667792216}}, sa_flags = 0, sa_restorer = 0xe}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x0000aaaad22c31ac in ha_panic ()
No symbol table info available.
#3  0x0000aaaad2303ea8 in wdt_handler ()
No symbol table info available.
#4  <signal handler called>
No symbol table info available.
#5  0x0000aaaad22aac98 in process_runnable_tasks ()
No symbol table info available.
#6  0x0000aaaad227e748 in run_poll_loop ()
No symbol table info available.
#7  0x0000aaaad227eb98 in ?? ()
No symbol table info available.
#8  0x0000aaaad2145f0c in main ()
No symbol table info available.

The one dying in process_runnable_tasks() is amazing, because that's precisely the one that resets the watchdog in the loop, so by definition, unless it's stuck looping forever on a corrupted tree, it should not die there. We've recently found and addressed a few very nasty bugs that could cause random memory corruption, having such behaviors as possible side effects. So I'd be interested in having you try the very latest 2.3-maint from git to be certain we're not all stupidly wasting our time on something already fixed.

Another point that comes to mind is, these aarch64 machines use the Graviton2 processor probably ? Could you please post the output of cat /proc/cpuinfo so that we can see the exact families, models and architecture levels ? It would be totally possible that they are less forgiving than the usual Cortex and that we're missing a barrier here and there. x86 is the worst to develop on because it works perfectly all the time and covers your bugs. This is also why I use ARM a lot. But Cortex A72 typically are pretty tolerant as well and maybe others are not. A test could be to run with "nbproc 16" and thus replace threads by processes. If the problem reliably disappears, it could indicate we're having an issue there.

For sure on the /proc/cpuinfo front:

root@fes-lb-eu-west-1-ip-10-3-21-186:/var/lib/systemd/coredump# cat /proc/cpuinfo
processor       : 0
BogoMIPS        : 243.75
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x3
CPU part        : 0xd0c
CPU revision    : 1

processor       : 1
BogoMIPS        : 243.75
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x3
CPU part        : 0xd0c
CPU revision    : 1

(There's a lot more cores, but all identical). For next steps I'll figure out how to get the ppa we use (https://launchpad.net/~vbernat/+archive/ubuntu/haproxy-2.3) rebuild with the latest sha from git so we're identically configured, but on newer code!

Very interesting, I thought it was an in-house CPU but no, it's a Neoverse N1 core from ARM (in short an upgraded Cortex A76, using ARMv8.2A). Now I know what I want for Xmas :-) I've never run tests further than ARMv8.0 (Cortex A72), so I cannot completely rule out the possibility that we did something wrong (especially with the loop in process_runnable_tasks). By the way, when you rebuild, please add -g to CFLAGS and LDFLAGS (or just don't force them, it's on by default) so that we have the debugging symbols.

Thanks!

In the meantime, I realized the ppa does have the debug symbols as a package which means... (from a different node than before):

Nov 25 12:29:18 fes-lb-eu-west-1-ip-10-3-20-7 haproxy[2001792]: Thread 7 is about to kill the process.

[Switching to thread 7 (Thread 0xffffaf2e1a40 (LWP 2001796))]
#0  0x0000aaaabba6223c in task_kill (t=0xffff600904b0) at src/task.c:98
98      src/task.c: No such file or directory.
(gdb) bt
#0  0x0000aaaabba6223c in task_kill (t=0xffff600904b0) at src/task.c:98
#1  0x0000aaaabb9981cc in h1_takeover (conn=<optimized out>, orig_tid=<optimized out>) at src/mux_h1.c:2985
#2  0x0000aaaabb9b5cf0 in conn_backend_get (srv=srv@entry=0xaaaaeeda9d50, is_safe=0, s=<optimized out>, s=<optimized out>)
    at src/backend.c:1177
#3  0x0000aaaabb9bada8 in connect_server (s=s@entry=0xffffb60ba590) at src/backend.c:1302
#4  0x0000aaaabb9bce2c in back_try_conn_req (s=0xffffb60ba590) at src/backend.c:1771
#5  0x0000aaaabb987d38 in process_stream (t=0xaaaaefdf64b0, context=0xffffb60ba590, state=<optimized out>)
    at src/stream.c:2101
#6  0x0000aaaabba63238 in run_tasks_from_lists (budgets=budgets@entry=0xffffaf2e1118) at src/task.c:493
#7  0x0000aaaabba63d10 in process_runnable_tasks () at src/task.c:689
#8  0x0000aaaabba37748 in run_poll_loop () at src/haproxy.c:2849
#9  0x0000aaaabba37b98 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3014
#10 0x0000ffffbbaa93c8 in start_thread (arg=0xffffc966234f) at pthread_create.c:477
#11 0x0000ffffbb526f0c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) info threads
  Id   Target Id                           Frame
  1    Thread 0xffffae2dfa40 (LWP 2001798) __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
  2    Thread 0xffffafae2a40 (LWP 2001795) conn_backend_get (srv=srv@entry=0xaaaaeedb1ac0, is_safe=<optimized out>,
    s=<optimized out>, s=<optimized out>) at src/backend.c:1164
  3    Thread 0xffffbb23b010 (LWP 2001792) 0x0000aaaabba63c9c in process_runnable_tasks () at src/task.c:681
  4    Thread 0xffffbb22fa40 (LWP 2001793) 0x0000aaaabba4bd0c in tasklet_wakeup_on (thr=0, tl=0xaaaaeef74be0)
    at include/haproxy/task.h:354
  5    Thread 0xffffb02e3a40 (LWP 2001794) 0x0000aaaabba4bd0c in tasklet_wakeup_on (thr=0, tl=0xaaaaeef74be0)
    at include/haproxy/task.h:354
  6    Thread 0xffffaeae0a40 (LWP 2001797) 0x0000aaaabba4bd0c in tasklet_wakeup_on (thr=0, tl=0xaaaaeef74be0)
    at include/haproxy/task.h:354
* 7    Thread 0xffffaf2e1a40 (LWP 2001796) 0x0000aaaabba6223c in task_kill (t=0xffff600904b0) at src/task.c:98
  8    Thread 0xffffadadea40 (LWP 2001799) tasklet_wakeup_on (thr=0, tl=0xaaaaeef74be0) at include/haproxy/task.h:354
(gdb) bt full
#0  0x0000aaaabba6223c in task_kill (t=0xffff600904b0) at src/task.c:98
        n = <optimized out>
        p = <optimized out>
        _ret = 0
        lh = 0xaaaabbc035a0 <task_per_thread+32>
        el = 0xffff600904c8
        state = <optimized out>
        thr = 0
#1  0x0000aaaabb9981cc in h1_takeover (conn=<optimized out>, orig_tid=<optimized out>) at src/mux_h1.c:2985
        h1c = 0xffff91bb5200
        task = <optimized out>
#2  0x0000aaaabb9b5cf0 in conn_backend_get (srv=srv@entry=0xaaaaeeda9d50, is_safe=0, s=<optimized out>, s=<optimized out>)
    at src/backend.c:1177
        elt1 = 0xffffa6646508
        elt2 = {next = 0xaaaaf1b056c8, prev = 0xaaaaeeef4860}
        mt_list = <optimized out>
        conn = 0xffffa66464c0
        i = 0
        found = 0
        stop = 0
#3  0x0000aaaabb9bada8 in connect_server (s=s@entry=0xffffb60ba590) at src/backend.c:1302
        cli_conn = 0xaaaaf06b2b90
        srv_conn = 0x0
        srv_cs = 0x0
        srv = 0xaaaaeeda9d50
        reuse = 0
        init_mux = 0
        err = <optimized out>
#4  0x0000aaaabb9bce2c in back_try_conn_req (s=0xffffb60ba590) at src/backend.c:1771
        conn_err = <optimized out>
        srv = 0xaaaaeeda9d50
        si = 0xffffb60ba890
        req = 0xffffb60ba5a0
#5  0x0000aaaabb987d38 in process_stream (t=0xaaaaefdf64b0, context=0xffffb60ba590, state=<optimized out>)
    at src/stream.c:2101
        srv = <optimized out>
        s = 0xffffb60ba590
        sess = 0xffff88d8e280
        rqf_last = <optimized out>
        rpf_last = 2147483648
        rq_prod_last = 8
        rq_cons_last = 0
        rp_cons_last = 8
        rp_prod_last = 0
        req_ana_back = <optimized out>
        req = 0xffffb60ba5a0
        res = 0xffffb60ba600
        si_f = 0xffffb60ba838
        si_b = 0xffffb60ba890
        rate = <optimized out>
#6  0x0000aaaabba63238 in run_tasks_from_lists (budgets=budgets@entry=0xffffaf2e1118) at src/task.c:493
        process = <optimized out>
        tl_queues = 0xaaaabbc037b0 <task_per_thread+560>
        t = 0xaaaaefdf64b0
        budget_mask = <optimized out>
        done = <optimized out>
        queue = <optimized out>
        state = <optimized out>
        ctx = <optimized out>
#7  0x0000aaaabba63d10 in process_runnable_tasks () at src/task.c:689
        tt = 0xaaaabbc03780 <task_per_thread+512>
        lrq = <optimized out>
        grq = <optimized out>
        t = <optimized out>
        max = {0, 197, 0}
        max_total = <optimized out>
        tmp_list = <optimized out>
        queue = 3
        max_processed = 198
#8  0x0000aaaabba37748 in run_poll_loop () at src/haproxy.c:2849
        next = <optimized out>
        wake = <optimized out>
#9  0x0000aaaabba37b98 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3014
        ptaf = <optimized out>
        ptif = <optimized out>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {
              __prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 47 times>, __align = 0}
        init_cond = {__data = {{__wseq = 47, __wseq32 = {__low = 47, __high = 0}}, {__g1_start = 33, __g1_start32 = {
                __low = 33, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 28, __wrefs = 0,
            __g_signals = {0, 0}},
          __size = "/\000\000\000\000\000\000\000!", '\000' <repeats 23 times>, "\034", '\000' <repeats 14 times>,
          __align = 47}
#10 0x0000ffffbbaa93c8 in start_thread (arg=0xffffc966234f) at pthread_create.c:477
        ret = <optimized out>
        pd = 0xffffc966234f
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {281473620777536, 281474060657488, 281474060657486, 281474060657487,
                281473821421328, 281473830416384, 4096, 281474060657488, 281473620775744, 281473620777536,
                281473620775472, 4240161074016231900, 0, 4240161074225717560, 0, 0, 0, 0, 0, 0, 0, 0},
              mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0,
              canceltype = 0}}}
        not_first_call = <optimized out>
#11 0x0000ffffbb526f0c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
No locals.

There's a catch, the thread numbers seen by gdb do not match the thread numbers in the process (that's one of the pain points that don't help debugging). It's possible to figure them by entering the thread and issuing p tid. But here it's not nedeed. Gdb shows that thread 1 is the one that raised the signal, hence it's the dying one. So you'll need to switch to thread 1 and issue bt full again.

Aha, I didn't realize that, that definitely explains why I didn't see "signal handler called" which I expected to see!

gdb) thread 1
[Switching to thread 1 (Thread 0xffffae2dfa40 (LWP 2001798))]
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {18446744067195533847, 0, 552678054096, 281473032161008, 0, 0, 105, 0, 18446744073709551615, 0,
            281473827559540, 281473094267792, 4294967295, 281473603983648, 281473823767812, 281473825107736}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x0000ffffbb477d54 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x2c323530322c3435, sa_sigaction = 0x2c323530322c3435}, sa_mask = {
            __val = {3834307298121299761, 3544670646715363377, 18446744067267100671, 18446744073709551615,
              18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615,
              18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615,
              18446744073709551615, 281473603983968, 187650269495720, 281473604003720}}, sa_flags = 24, sa_restorer = 0xe}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x0000aaaabba7c1ac in ha_panic () at src/debug.c:271
No locals.
#3  0x0000aaaabbabcea8 in wdt_handler (sig=14, si=<optimized out>, arg=<optimized out>) at src/wdt.c:119
        n = <optimized out>
        p = <optimized out>
        thr = <optimized out>
#4  <signal handler called>
No symbol table info available.
#5  conn_backend_get (srv=srv@entry=0xaaaaeed92c50, is_safe=<optimized out>, s=<optimized out>, s=<optimized out>)
    at src/backend.c:1164
        __pl_i = <optimized out>
        __pl_r = <optimized out>
        elt1 = <optimized out>
        elt2 = <optimized out>
        mt_list = <optimized out>
        conn = <optimized out>
        i = 0
        found = 0
        stop = 5
#6  0x0000aaaabb9bada8 in connect_server (s=s@entry=0xffffa8b35b80) at src/backend.c:1302
        cli_conn = 0xffffa81868f0
        srv_conn = 0x0
        srv_cs = 0x0
        srv = 0xaaaaeed92c50
        reuse = 0
        init_mux = 0
        err = <optimized out>
#7  0x0000aaaabb9bce2c in back_try_conn_req (s=0xffffa8b35b80) at src/backend.c:1771
        conn_err = <optimized out>
        srv = 0xaaaaeed92c50
        si = 0xffffa8b35e80
        req = 0xffffa8b35b90
8  0x0000aaaabb987d38 in process_stream (t=0xffffa3561680, context=0xffffa8b35b80, state=<optimized out>)
    at src/stream.c:2101
        srv = <optimized out>
        s = 0xffffa8b35b80
        sess = 0xffff7c443db0
        rqf_last = <optimized out>
        rpf_last = 2147483648
        rq_prod_last = 8
        rq_cons_last = 0
        rp_cons_last = 8
        rp_prod_last = 0
        req_ana_back = <optimized out>
        req = 0xffffa8b35b90
        res = 0xffffa8b35bf0
        si_f = 0xffffa8b35e28
        si_b = 0xffffa8b35e80
        rate = <optimized out>
#9  0x0000aaaabba63238 in run_tasks_from_lists (budgets=budgets@entry=0xffffae2df118) at src/task.c:493
        process = <optimized out>
        tl_queues = 0xaaaabbc038b0 <task_per_thread+816>
        t = 0xffffa3561680
        budget_mask = <optimized out>
        done = <optimized out>
        queue = <optimized out>
        state = <optimized out>
        ctx = <optimized out>
#10 0x0000aaaabba63d10 in process_runnable_tasks () at src/task.c:689
        tt = 0xaaaabbc03880 <task_per_thread+768>
        lrq = <optimized out>
        grq = <optimized out>
        t = <optimized out>
        max = {0, 198, 0}
        max_total = <optimized out>
        tmp_list = <optimized out>
        queue = 3
        max_processed = 199
#11 0x0000aaaabba37748 in run_poll_loop () at src/haproxy.c:2849
        next = <optimized out>
        wake = <optimized out>
#12 0x0000aaaabba37b98 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3014
        ptaf = <optimized out>
        ptif = <optimized out>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {
              __prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 47 times>, __align = 0}
        init_cond = {__data = {{__wseq = 47, __wseq32 = {__low = 47, __high = 0}}, {__g1_start = 33, __g1_start32 = {
                __low = 33, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 28, __wrefs = 0,
            __g_signals = {0, 0}},
          __size = "/\000\000\000\000\000\000\000!", '\000' <repeats 23 times>, "\034", '\000' <repeats 14 times>,
          __align = 47}
#13 0x0000ffffbbaa93c8 in start_thread (arg=0xffffc966234f) at pthread_create.c:477
        ret = <optimized out>
        pd = 0xffffc966234f
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {281473603992128, 281474060657488, 281474060657486, 281474060657487,
                281473821421328, 281473830416384, 4096, 281474060657488, 281473603990336, 281473603992128,
                281473603990064, 4240161074016231900, 0, 4240161074208981304, 0, 0, 0, 0, 0, 0, 0, 0},
              mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0,
              canceltype = 0}}}
        not_first_call = <optimized out>
#14 0x0000ffffbb526f0c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Ah that's getting better! So basically we're waiting on the takeover_lock that's currently held by the other thread you initially dumped (7) which itself is waiting on a MT_LIST_ADDQ(). And everyone else is waiting on MT_LIST_TRY_ADDQ() which should normally not wait, or MT_LIST_BEHEAD() which should be fast. So definitely we're observing a deadlock here. I remember that Olivier had spotted a possible barrier issue in the MT_LIST code a while ago and this could possibly be just this one! @cognet what do you think ?

IIRC it was not a barrier issue, but there's a small race condition in what is now called MT_LIST_TRY_ADDQ/MT_LIST_TRY_ADD, because nothing prevents something from adding the element between the time we check if it is not already in a list, and the time we add it. But that would only happen if we add those elements to different list, and given the usage of MT_LIST_TRY*, it's probably not the issue here.
If it only happens on arm, it definitively sounds like a barrier issue, though.

OK thanks. I've just found that in commit 3b8f9b7 I fixed a missing barrier issue that was extremely difficult to trigger and mentioned that I suspected another missing one that I couldn't demonstrably claim was missing. I think I'll have to recheck that. In the worst case I could propose a tentative patch adding barriers on each line for a test. It doesn't hurt.

Ah hmm there are multiple lists, nowadays, so it MAY be an issue. I updated my patch, it's the fixed version of the one you backed out, so it should be fine, but I only tested it very ilghtly.
0001-BUG-MEDIUM-lists-Lock-the-element-while-we-check-if-.txt

Cool, anyway with Ashley we'll soon have an idea :-)

I won't be able to apply this today (and possibly tomorrow, thanksgiving interfering), but I'll try to get it applied on the friday (or saturday at the latest) and we'll see if this helps! I want to say thank you to both of you (@cognet and @wtarreau), regardless of if this works, for helping on this issue!

Great, thank you! Olivier, I think that you ought not to have removed the two store barriers before the break, they were the ones that fixed the crashes for me on the octo-A53. And if you look closely you'll notice that you're using the barrier-less _HA_ATOMIC_XCHG() and that you'll need a barrier there at the end. So I would suggest to re-add the _ha_barrier_store() before the two break. But you're the expert here so if you're certain I'll trust you (but will probably need to retest :-))

Yep you're probably right, updated version with barrier added :
0001-BUG-MEDIUM-lists-Lock-the-element-while-we-check-if-.txt

Thanks! I'll try to find some time to run it on my octo-A53 and the quad A72 tomorrow, since at least the former was very sensitive to this code path. If it doesn't change it will be a good hint already!

@wtarreau if you'd like access to a Graviton2 system please let me know. I'm happy to also lend my experience with barrier placement.

There might be a barrier missing as @wtarreau has postulated. The core in graviton2 is much more aggressive than the A57/72 generation about re-ordering memory operations.

Couple of quick comments which probably aren't the root cause, but might be something to look at. The pl_cpu_relax() is empty for everything but x86. In other projects (e.g. MySQL) we've found putting an isb there is a reasonable substitute for the behavior of x86 pause or rep; nop. The exclusive atomics aren't fair, while the LSE atomics are. Under high lock contention we've seen one thread retain a lock for a long time when using the load/store exclusive operations.

We should try compiling this with -moutline-atomics which is the default with GCC10. That will take the atomics and put a runtime check for load/store exclusives vs. atomics. For something that only support arm v8.1 and newer compiling with -march=armv8.1-a will just use the LSE atomics, but not be backward compatible. Even then the one thing that would still be using ldxp/stxp is the implementation of __ha_cas_dw(). I can try to come up with an LSE implementation of that, but it too would require either compiling for armv8.1-a in this case or a runtime check.

Hi @AGSaidi !
Thanks for the offer, I'd say "why not". In parallel I'll check with HaproxyTech if I can have an always-on c6g available for testing. A Spot instance shouldn't be expensive and should be available often enough for most use cases.

Regarding the use of isb, that's an excellent idea, it should significantly reduce the cross-CPU noise caused by periodic polling of the lock! I can even test that on my 8-a53, 4-a72 and 4-a55. Also, given that this is available since ARMv7, for us it's trivial to add at compile time.

Regarding the new atomics, both @cognet and myself have been interested in writing a new implementation for them. We've even bought a pair of A55 just for the sake of testing the new CAS instruction, but the benefits were not obvious on only 4 cores, as you can imagine, so or now we made no progress in this direction. However, with 16 threads it could definitely change.

We're not interested in runtime checks, I'd rather use a build-time one. The cost of using a function call for a lock is huge, especially in all situations where it's not needed (i.e. single-threaded) or when it's almost guaranteed to be acquired (such as a few list insertions which are only used for the rare cases of an admin dumping the session table). Could you please post the output of gcc -march=native -dM -E -xc -</dev/null on your Graviton2 ? I guess you'll have __ARM_FEATURE_ATOMICS set, which will be enough to switch to the newer implementation.

Hi @wtarreau!

Around 16 threads it seems to start to make a lot of difference and by 64 it's very noticeable.

The runtime check i'm referring too isn't quite as bad as you first think. Something like the following is what i had in mind that gets optimized into 1 very predictable branch. In most cases i've had trouble seeing performance differences between it and a compile time check.

int hasLSE;

void init()
{
     hasLSE = !!(getauxval(AT_HWCAP) &  HWCAP_ATOMICS);
     ...
}

int swp(int *a, val)
{
    if (hasLSE) {   /* LSE Impl */ } else { /* exclusives impl */ }
}

I've attached the output of the pre-processor defines, but yes as you expect __ARM_FEATURE_ATOMICS is set and the use of the __sync_* and __atomic_* built-in change to their LSE versions.

gcc-9-graviton2-native-preprocessor-defines.txt

I'm not surprised it makes such a huge difference. While I love the semantics of LL/SC which allow to perform complex operations at once, I also understand why it cannot scale when the contention is high: everyone starts an operation that it cannot complete if it takes too long. In such a case, exponential backoff is a must but can add unfairness in non-uniform core clusters.

Regarding the test, I'm still not fond of adding an extra if inside each and every CAS for the reason I explained previously (some having little to no contention). However at some point once ARMv8.1-a CPUs become prevalent, we could imagine turning the old ones to this approach. That's a principle we've always had in haproxy: what is relevant needs to be the absolute fastest, what's legacy needs to be compatible and acceptably fast. This helps distros keep the best of both worlds by having the same binary work slightly slower on legacy hardware and still a bit faster than legacy on newer one.

Guys, first good news, after having had an access on a c6g instance, I managed to reproduce a crash after 3 hours of hammering 16 threads, and it died there:

#5  tasklet_wakeup_on (thr=3, tl=0xaaaae2fa1440) at include/haproxy/task.h:354
354                     if (MT_LIST_TRY_ADDQ(&task_per_thread[thr].shared_tasklet_list, (struct mt_list *)&tl->list) == 1) {

The less good news is that after applying Olivier's second patch, it died after 49s here:

#5  0x0000aaaaeaab1c8c in dequeue_all_listeners () at src/listener.c:510
510             while ((listener = MT_LIST_POP(&global_listener_queue, struct listener *, wait_queue))) {

It's worth noting that I've initially suspected another missing barrier in MT_LIST_POP which died in the first entry of the loop, but it's unclear to me if it's really needed given that we're exchanging MT_LIST_BUSY with itself when this happens:

   0x0000aaaaeaab1c88 <+32>:    ldxr    x0, [x19]
=> 0x0000aaaaeaab1c8c <+36>:    stxr    w1, x20, [x19]
   0x0000aaaaeaab1c90 <+40>:    cbnz    w1, 0xaaaaeaab1c88 <dequeue_all_listeners+32>
   0x0000aaaaeaab1c94 <+44>:    cmp     x0, #0x1
   0x0000aaaaeaab1c98 <+48>:    b.eq    0xaaaaeaab1c88 <dequeue_all_listeners+32>  // b.none
   0x0000aaaaeaab1c9c <+52>:    cmp     x0, x19
   0x0000aaaaeaab1ca0 <+56>:    b.eq    0xaaaaeaab1d20 <dequeue_all_listeners+184>  // b.none

Now trying again with this:

diff --git a/include/haproxy/list.h b/include/haproxy/list.h
index a23a96857..7cd4691fd 100644
--- a/include/haproxy/list.h
+++ b/include/haproxy/list.h
@@ -508,8 +508,10 @@
                         struct mt_list *n, *n2;                           \
                         struct mt_list *p, *p2;                           \
                         n = _HA_ATOMIC_XCHG(&(lh)->next, MT_LIST_BUSY);   \
-                        if (n == MT_LIST_BUSY)                            \
+                        if (n == MT_LIST_BUSY) {                          \
+                                __ha_barrier_store();                     \
                                 continue;                                 \
+                        }                                                 \
                         if (n == (lh)) {                                  \
                                 (lh)->next = lh;                          \
                                 __ha_barrier_store();                     \

Unsure whether we'll need an __ha_barrier_load() before the first xchg though.

That's not enough, now we're crashing here:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  __tasklet_remove_from_tasklet_list (t=0xaaaadce977b0) at include/haproxy/task.h:389
389             LIST_DEL_INIT(&t->list);
[Current thread is 1 (Thread 0xffff8b2e1240 (LWP 31367))]
(gdb) p *t
$1 = {{state = 0, nice = -32768, calls = 22223, process = 0xaaaac2ad2d90 <accept_queue_process>, context = 0xaaaac2c02ac0 <accept_queue_rings+115584>}, list = {n = 0x1, 
    p = 0xaaaac2c70b30 <task_per_thread+1840>}, tid = 14}

Looks like an MT_LIST_BUSY leaked there after an MT_LIST_DEL.

I've found that MT_LIST_UNLOCK_ELT and friends do not use any barrier after updating pointers, and these are used in the iterators, but this doesn't explain this crash since we don't use them for tasks/tasklets. I added some anyway and fired the test again.

So as expected that didn't help. Crashed after 7 minutes, staying stuck for 2 seconds here:

(gdb) 
#5  0x0000aaaad1eccd00 in process_runnable_tasks () at src/task.c:681
681             tmp_list = MT_LIST_BEHEAD(&tt->shared_tasklet_list);
(gdb) p tt->shared_tasklet_list 
$1 = {next = 0xaaaad204eaa0 <task_per_thread+1696>, prev = 0x1}

There's another thread on the same entry: tasklet_insert_into_tasklet_list():

[Switching to thread 11 (Thread 0xffffa0ae4240 (LWP 32105))]
#0  tasklet_insert_into_tasklet_list (tl=<optimized out>, list=<optimized out>) at include/haproxy/task.h:378
378             LIST_ADDQ(list, &tl->list);
(gdb) p tl->list
value has been optimized out
(gdb) disassemble 
   0x0000aaaad1eccc54 <+660>:   add     w2, w2, #0x1
   0x0000aaaad1eccc58 <+664>:   stxr    w3, w2, [x0]
   0x0000aaaad1eccc5c <+668>:   cbnz    w3, 0xaaaad1eccc50 <process_runnable_tasks+656>
=> 0x0000aaaad1eccc60 <+672>:   ldr     x0, [x20, #72]
   0x0000aaaad1eccc64 <+676>:   str     x0, [x21, #32]
   0x0000aaaad1eccc68 <+680>:   str     x1, [x20, #72]
   0x0000aaaad1eccc6c <+684>:   add     x2, x20, #0x40
(gdb) p/x $x20
$2 = 0xaaaad204e900
(gdb) p/x $x0
$4 = 0xaaaad2046814
(gdb) p/x $x21
$5 = 0xffff8c903190
(gdb) p/x $x1
$6 = 0xffff8c9031a8

So none of them is the same pointer (fortunately!). No other thread is touching these pointers, still we're seeing a BUSY stuck here.

Out of curiosity I tried to add load barriers before the first XCHG in the loops but it didn't change anything:

Core was generated by `./haproxy -db -f test-chain-lb50-thread.cfg'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __tasklet_remove_from_tasklet_list (t=0xaaab20744530) at include/haproxy/task.h:389
389             LIST_DEL_INIT(&t->list);
[Current thread is 1 (Thread 0xffff85f16240 (LWP 32813))]
(gdb) p t->list
$1 = {n = 0x1, p = 0xaaaae6fe4730 <task_per_thread+816>}

Now trying using old _sync* API to try to narrow the problem down further.

It crashed again after 15mn but I couldn't get the core due to an FS full. So maybe we're still having too weak ordering between some pointers. Starting again gave the same trace as above, except the list entry is correct at the moment of the dump:

Thread 9 "haproxy" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffffeaae8240 (LWP 42572)]
0x0000aaaaaac0b664 in __tasklet_remove_from_tasklet_list (t=0xaaaaab06c5a0) at include/haproxy/task.h:389
389             LIST_DEL_INIT(&t->list);
(gdb) p t
$1 = (struct tasklet *) 0xaaaaab06c5a0
(gdb) p t->list
$2 = {n = 0xffffa55fd188, p = 0xaaaaaad8f7b0 <task_per_thread+944>}
(gdb) p t->list->n
$3 = (struct list *) 0xffffa55fd188
(gdb) p *t->list->n
$4 = {n = 0xffffa49ae538, p = 0xaaaaab06c5b8}
(gdb) p *t->list->p
$5 = {n = 0xaaaaab06c5b8, p = 0xffffab0bba78}

Given that the pointers were valid, it means they were caught bad for a short time (maybe the next/prev pointers were not properly written before the element was detached). Here we're clearly seeing that a BUSY was caught in list->n while unlinking the element:

   0x0000aaaaaac0b64c <+1740>:  stlxrh  w1, w13, [x7]
   0x0000aaaaaac0b650 <+1744>:  cbnz    w1, 0xaaaaaac0b640 <run_tasks_from_lists+1728>
   0x0000aaaaaac0b654 <+1748>:  dmb     ish
   0x0000aaaaaac0b658 <+1752>:  b.ne    0xaaaaaac0b63c <run_tasks_from_lists+1724>  // b.any
   0x0000aaaaaac0b65c <+1756>:  dmb     ishst
   0x0000aaaaaac0b660 <+1760>:  ldp     x1, x0, [x19]
=> 0x0000aaaaaac0b664 <+1764>:  str     x0, [x1, #8]
   0x0000aaaaaac0b668 <+1768>:  str     x1, [x0]
   0x0000aaaaaac0b66c <+1772>:  ldr     x0, [sp, #96]
   0x0000aaaaaac0b670 <+1776>:  stp     x19, x19, [x19]
   0x0000aaaaaac0b674 <+1780>:  ldxr    w1, [x0]
   0x0000aaaaaac0b678 <+1784>:  sub     w1, w1, #0x1
   0x0000aaaaaac0b67c <+1788>:  stlxr   w3, w1, [x0]
   0x0000aaaaaac0b680 <+1792>:  cbnz    w3, 0xaaaaaac0b674 <run_tasks_from_lists+1780>

(gdb) p/x $x0
$7 = 0xaaaaaad8f7b0
(gdb) p/x $x1
$8 = 0x1
(gdb) p/x $x19
$9 = 0xaaaaab06c5b8

This makes me think that what happened is that the neighbor's prev pointer was not flushed, and that a parallel thread walking over the list temporarily placed 1 in prev->next and modified this already unlinked element. Shouldn't we need more load barriers before reading the neighbors pointers in MT_LIST_* ?

Now trying again with some load barriers before the XCHG referencing a pointer learned from the previous step. I'll check the result later.

Did not help:

Core was generated by `./haproxy -D -f test-chain-lb50-thread.cfg'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000aaaae32689e4 in __tasklet_remove_from_tasklet_list (t=0xaaaafc1f7620)
    at include/haproxy/task.h:389
389             LIST_DEL_INIT(&t->list);
[Current thread is 1 (Thread 0xffff8b45f240 (LWP 43301))]
(gdb) p t->list
$1 = {n = 0x1, p = 0xaaaae33ed8b0 <task_per_thread+1200>}

The thread sanitizers can sometimes help track these sorts of problems down.

I’ve not yet groked the code in question but I’m happy to try and help if you want to walk though the code with me.

Today I modified the code to tag each pointer so that we have different values in n->p and p->n. I crashed again and I got x1 (next) = 9 which is BUSY3 in my code. It's used only at a few places:

#define MT_LIST_TRY_ADD(_lh, _el)                                              \
		n2 = _HA_ATOMIC_XCHG(&(el)->next, MT_LIST_BUSY3);           \

#define MT_LIST_TRY_ADDQ(_lh, _el)                                             \
		n2 = _HA_ATOMIC_XCHG(&(el)->next, MT_LIST_BUSY3);           \

#define MT_LIST_DEL(_el)                                                   \
		        p2 = _HA_ATOMIC_XCHG(&p->next, MT_LIST_BUSY3);      \

#define MT_LIST_POP(_lh, pt, el)                                           \
			 n2 = _HA_ATOMIC_XCHG(&n->next, MT_LIST_BUSY3);     \

#define MT_LIST_LOCK_ELT(_el)                                              \
			        p2 = _HA_ATOMIC_XCHG(&p->next, MT_LIST_BUSY3);\

While MT_LIST_LOCK_ELT() fails to set any barrier on success, it's not used so it's not this one.

All other ones are using store barriers before breaking out of the loop. Thus my conclusion is that we're instead missing a load barrier before setting one of them and that we're reading the next that was set by a concurrent thread who lost the race. Going to try that now.

I now have a suspicion that I'm testing. I don't know if it's possible that an atomic xchg and a write to the same location are reordered (write after write), but if so then we could have the probem in the rollback code when the race is lost because we could see the old value being written first, then the exclusive write finish and overwrite the value with the busy tag. I've tried to adjust this with no success yet.

I'm wondering if we're not having a race between MT_LIST_BEHEAD and MT_LIST_ADDQ. The LIST_DEL_INIT in __tasklet_remove_from_tasklet_list operates on a detached list that cannot be shared. But what if MT_LIST_ADDQ unlocks the list's head too fast, letting MT_LIST_BEHEAD proceed ? We could then have a detached list where a BUSY still remains, causing the subsequent LIST_DEL_INIT() to see a BUSY and crash. I'm now trying with tasklet_remove_from_tasklet_list() instead to confirm this hypothesis.

OK no more crashes this way, we're getting closer :-)

@wtarreau is there an assumption in MT_LIST_TRY_ADDQ() and MT_LIST_BEHEAD() that the atomic exchanges of the prev and next pointers will occur in order (e.g. between the following)? It appears as this is the case, but _HA_ATOMIC_XCHG() is using ATOMIC_RELAXED ordering so there isn't anything guaranteeing the exchanges will happen in order. Perhaps a load barrier is required between them?

               p = _HA_ATOMIC_XCHG(&(lh)->prev, MT_LIST_BUSY);            \
                if (p == MT_LIST_BUSY)                                     \
                        continue;                                          \
                n = _HA_ATOMIC_XCHG(&p->next, MT_LIST_BUSY);               \

I initially tried to place load barriers before every XCHG because I wasn't sure about load ordering but that didn't help (though they possibly are necessary). Also I got the crash even using the old _sync* API. with this said there should normally not be any ordering requirement between these two macros since the lost race is pessimistic and undoes everything to try again from the beginning.

But now I figured the apparent root cause. @cognet's fix is necessary but not 100% correct, on TRY_ADDQ's rollback path it restores lh->next before restoring the other elements' pointer, this releases a concurrent MT_LIST_BEHEAD() which can detach the list which still contains an element whose prev or next is BUSY. Then it's appended as-is to the local list and when LIST_DEL_INIT() passes over it, it stumbles over a BUSY pointer which is not expected. Changing that seems to have definitely fixed the issue for me (knocking wood). I'm cleaning up my changes and will propose Olivier's patch updated, but there's still an issue in it that I'd like him to have a look at before merging (but it's in the TRY_ADD macro that is not used so it will not have any impact).

Also, I met a watchdog condition around the dequeue_listener() code where everything was correct. The lock was temporarily held by one thread and many were waiting. Thus for me it was clearly a fairness issue with one thread never managing to get it. I've just added your isb suggestion to pl_cpu_relax and never met it anymore. In addition the req/s increased from 107.1 to 112.7 with the isb, that's >5%, it's pretty nice!

@wtarreau nice! can you try the same test but compiling with -march=native or -march=armv8.2-a depending if the OS you're using understands the core or not? I wonder what the performance is like with the new atomics which are also fair.

The reason I say there appears to be an ordering assumption with the XCHGs is that i don't see the store barriers paired with anything. The store barriers are creating an ordering but I don't see that same ordering being enforced on the load side which tends to be required.

Just tried with -march=native, I'm shocked, the performance went up by +250% (101k -> 357k). I'm still looking for issues but can't find anything wrong, so very likely the native atomics succeed way more often than the LL/SC ones!

Regarding the possible missing barriers I agree on principle because during my first review I'm sure I noticed some that I addressed. They're not in the patches above as I'll have to carefully diff the code and review each of them individually (since I initially thought that more were missing and I was wrong). But yes I've found at least 1 or 2 where I couldn't find any barrier between the completed XCHG and the other user.

It just died with a corrupted tree node in the scheduler :-(

(gdb) bt
#0  eb32sc_insert (root=root@entry=0xaaaaaad7b590 <task_per_thread+400>, new=new@entry=0xffffdca43698, scope=8)
    at src/eb32sctree.c:181
#1  0x0000aaaaaabfb964 in __task_wakeup (t=0xffffdca43680, root=0xaaaaaad7b590 <task_per_thread+400>)
    at src/task.c:149
#2  0x0000aaaaaab37688 in task_wakeup (f=16384, t=<optimized out>) at include/haproxy/task.h:195
#3  h1_release (h1c=0xffffdcbeac90) at src/mux_h1.c:783
#4  0x0000aaaaaab3ae8c in h1_detach (cs=<optimized out>) at src/mux_h1.c:2499
#5  0x0000aaaaaab2214c in cs_destroy (cs=0xffffc9651e80) at include/haproxy/connection.h:508
#6  si_release_endpoint (si=si@entry=0xffffa498acf0) at include/haproxy/stream_interface.h:182
#7  0x0000aaaaaab2a354 in stream_free (s=0xffffa498a9f0) at src/stream.c:689
#8  process_stream (t=0xffffc5512610, context=0xffffa498a9f0, state=<optimized out>) at src/stream.c:2430

I'm having a problem when I read the tasklet_wakeup() code because I see nothing that can protect against the competing MT_LIST_BEHEAD() in process_runnable_tasks(): if we extract only a single task, the list will have a single entry which will appear empty, and allow tasklet_wakeup() to queue it into the shared list again, while the other function will add it to the thread-local list and use it unlocked. This is not specific to the architecture and is rather a design corner case that we have in the scheduler. I'll have to discuss this with Olivier. Also I think that the LIST_DEL_INIT() done while dequeueing that local task could also make the entry randomly appear as empty and mess with it there.

The atomics are fair unlike the exclusives and the cost of acquiring a contend one is lower, so for something with lots of contention it makes a huge difference. I've seen those types of gains in databases before, so it's not unheard of and really the reason I mentioned the detection mechanism above. The gains can be so huge that the runtime check is worth it.

OK but really no, that's a limit I'm not going to pass any time soon. dynamic checks at the core of the inmost loops and everywhere atomics are used is going to mess up with the compiler's optimizations, branch prediction, will significantly inflate the code, and worse, will make it way harder to debug. The code is opensource, and the 0.001% of users who need to soak the latest drop of extra performance already compile it themselves. We're speaking about performance numbers that are 100 to 1000 times beyond was the 1% of highest performance users need already. So I'm really not willing to go through that hassle just to save 6 seconds of build to possibly 4 or maybe 5 persons in the world :-/

Ok but really no

Understood.

I have nothing new to add, just wanted to reiterate my endless thanks for continuing to tackle this one!

@apenney if your machines are still crashing often you may apply the two patches I proposed above. For me that doesn't crash anymore, unless built with -march=native (which we still need to properly address). So I don't claim it will solve everything yet but I think it will be way more stable than what it currently is.

I've just implemented a scheduler stress test which significantly eases these tests. Now with -march=native I can crash it in less than one second with only 2 threads. Without, it takes 16 threads and 128 tasks to crash in roughly 20 seconds. This will definitely help getting this fixed! For those curious, I'm just issuing this:

$ socat - /tmp/sock1 <<< "expert-mode on;debug dev sched tasklet single=1 count=128"

Seems like a great thing to run under tsan and see what it flags

Possibly, but I don't know tsan at all, I first learned its name in your previous message :-/

I think i'm missing something. I grabbed the latest changes with the sched test you just committed and haven't seen it crash with expert-mode on;debug dev sched tasklet single=1 count=128. Can you share the config your starting haproxy with?

Separately, Running the sanitizer at haproxy startup it flags a number of data races around the update_list sleeping_thread_mask and fdtab. I'm not sure how many of them are true issues, but used acquire/releases to make it happy there. When running the debug command above i see it flag races between the pool allocator memory at its use, but nothing else, and as i mentioned I don't see the crash.

Some notes on the thread sanitizer:
https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual
https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces

Sure, the config only has this:

global
        stats socket /tmp/sock1 mode 666 level admin

I managed to get it to die even with 2 threads but it's easier with 16 it seems.

There definitely is a race between LIST_DEL_INIT() in run_from_tasklet_list() and MT_LIST_TRY_ADDQ() from tasklet_wakeup_on(). I'm now working on replacing that with an extra flag to reliably mention that the tasklet is in the queue. We used to rely on the fact that is wasn't detached in the early days where tasklet_wakeup() was not permitted on a foreign task, but nowadays it happens when taking an FD over during connect_server() and when dispatching incoming connections, the two initial places where Ashley found many crashes.

I think we'd rather stop trying to debug this until the flag is present, the model is not reliable otherwise. By the way the race with LIST_DEL_INIT() cannot simply be fixed by using MT_LIST_DEL() because the condition will also match the result of an MT_LIST_BEHEAD() having exactly one foreign tasklet.

Now I made it happen... I attached the sanitizer output, but this one seems to be the race you found.
ngnix-tsan-sched.txt

WARNING: ThreadSanitizer: data race (pid=40972)
  Read of size 8 at 0xfffff5801360 by thread T2:
    #0 tasklet_wakeup_on include/haproxy/task.h:354 (haproxy+0x25e434)
    #1 tasklet_wakeup include/haproxy/task.h:369 (haproxy+0x25e434)
    #2 debug_tasklet_handler src/debug.c:715 (haproxy+0x25e434)
    #3 run_tasks_from_lists src/task.c:465 (haproxy+0x22bacc)
    #4 process_runnable_tasks src/task.c:689 (haproxy+0x22d884)
    #5 run_poll_loop src/haproxy.c:2851 (haproxy+0x1dbcd0)
    #6 run_thread_poll_loop src/haproxy.c:3016 (haproxy+0x1dc568)
    #7 <null> <null> (libtsan.so.0+0x2fb6c)

  Previous write of size 8 at 0xfffff5801360 by main thread:
    #0 __tasklet_remove_from_tasklet_list include/haproxy/task.h:389 (haproxy+0x22ba94)
    #1 run_tasks_from_lists src/task.c:464 (haproxy+0x22ba94)
    #2 process_runnable_tasks src/task.c:689 (haproxy+0x22d884)
    #3 run_poll_loop src/haproxy.c:2851 (haproxy+0x1dbcd0)
    #4 run_thread_poll_loop src/haproxy.c:3016 (haproxy+0x1dc568)
    #5 main src/haproxy.c:3710 (haproxy+0x1e0c70)

Regarding the races you've found, I suspect the initial ones are harmless because of the way we need to boot (we have to collect previously stored event and re-apply them to each thread). The one on the pool allocator could possibly be related to the double-word CAS: we're reading both initial values, storing them in a local struct and performing the comparison. There's no load barrier here, but I don't think it would change anything: in the worst case we'd fail the comparison when trying to write the changed value back I guess.

Now after modifying the code to make use of a flag to mention the presence in the tasklet list, I can't make the process fail anymore at all! I'll clean these up and refresh Olivier's patch, then will propose it here.

So finally I pushed all the fixes after hours of testing. @apenney you should definitely build using CPU_CFLAGS="-O2 -march=native". I now reach 400-420k req/s here, versus 101k in armv8-a mode, so the scaling will be much better and it's even likely that you could reach the same amount of performance as before on a smaller machine, and achieve more stable results. For now it's only in 2.4-dev. If you could retest before we backport to 2.3 then 2.2 this would be greatly appreciated. If you need me to prepare a set of backport patches to apply on top of 2.3 or 2.2 instead, let me know and I'll do them.

@AGSaidi I tried tsan, it was reasonably easy. In my opinion what is reported now is only false positives, essentially due to the fact that we use atomic writes to guarantee consistency and that there are non-atomic reads for info that is not critical. I also implemented the double-cas using the casp instruction available in armv8.1-a atomics, but the code is now much bigger and very slightly slower. I suspect that it's simply due to the fact that it's not as performance critical anymore as it used to be, and that the performance gain at the instruction level is offset by the larger code around.

I'll get a box up this afternoon/tomorrow morning with 2.4-dev and we'll get you some results!

Excellent, thank you!

Thanks so much for diving into this @wtarreau.

Regarding TSAN, I don't see barriers used after the various flags are consumed. I could be missing them, but the core isn't enforcing an order between the load that says the flag is set and subsequent loads in that case and one of the subsequent loads could be read ahead of the read that sets the flag.

For example, the following is insufficient to guarantee that Thread1 one reads global_data after flag is set. A correct implementation requires either __ATOMIC_ACQUIRE as the type or a __ha_barrier_load() after after the while loop.

Thread1:

while (!__atomic_load_n(&flag, __ATOMIC_RELAXED)) 
      pause();

local_data = global_data;

Thread2:

global_data = 42;
__ha_barrier_store();
flag = 1;

I agree, I noticed a few that matched this pattern, and in my opinion the reason it currently works is that there tends to be some nearby instructions which introduce these barriers. Initially we were only using sequentially consistent atomics but these were progressively relaxed with store barriers factored. However I'm fairly certain some load barriers are missing as I tried to add initially.

The canary with the 2.4 stuff is a day late - I needed to roll all our various config changes on intel out to all our running nodes, so that's keep me busy. Wrapped that up this afternoon finally, so tomorrow I'll start figuring out how to wrangle our automation to let me drop an updated 2.4 binary into! Sorry for the delay in confirming we're good here.

Ashley, no stress, really! I'm already thankful for all the tests you've gone through, and I know it's never easy to deploy test code without taking too much risk, not even mentioning the time it usually takes. As I said, if it's easier for you to run a patched 2.3 or 2.2, let me know and I'll work on the backports. And if it's too much of a pain to test, don't go through the hassle, you'll upgrade next time you have an opportunity. The positive point I'm seeing is that you should soon be able to run on smaller machines and make savings this way :-)

Alright, took me a while but the tip of git as of about 30 minutes ago is now live and taking traffic! I accidentally bought up the wrong binary (2.3.1) and it crashed in about 20 seconds, since switching to 2.4 a few minutes ago it stopped crashing, so fingers crossed.

Just for the record this is with march=native as well, so we should hopefully see good performance!

Still no crashes, so it's looking like (hopefully) things are good, I'll give it another 24 hours so it gets a full test. For some reason these arm instances always take less traffic than intel boxes (normally the exchanges spread traffic very evenly and quickly after adding a new box), so I have to wait for peak traffic volumes to ensure it stays up.

(It's weird, all the other boxes are doing 50k/sec and this one ARM instance does 40k/sec, but that's unrelated to any of the crashing stuff!)

BTW regarding the fact they take less traffic, it's possible that your clients optimize the requests based on the average response time and that these ones are slightly slower for each request. With more cores you can get a higher throughput, but each core is slower so the requester possibly prefers the other ones (or optimises the requests to balance all your machines' response times).

There are a few options you may be interested in trying, sometimes they help further improve latency at the expense of a higher CPU usage:

  • tune.sched.low-latency on # always pick high-priority tasks before other ones
  • tune.runqueue-depth 50 # process less tasks per round (def: 200)
  • tune.maxpollevents 50 # process less I/O per round (def: 200)
  • busy-polling # prevents CPU from sleeping (harder to monitor however)

But if you're already satisfied with the performance better not change them.

Performance slowly crept up overnight, and now they are taking a range of 39k (this arm machine) to 41k right now, so I'll consider that "close enough" :)

Zero crashes, looks like all these patches are spot on. Excited to see them backported to 2.3 and once that happens we'll give ARM in a couple of regions a try again, thank you once again for everything you dug into here, the patches, and the sheer effort it took to fix these.

That's excellent, thanks for your tests! We'll then backport all these then close the issue. I'd really encourage you to try on a smaller machine "just to see". Performance on network is never linear. If you're at, say, less than 40% CPU, maybe with only 8 cores you'll be at 60-70%.

I'll try some experiments for sure - I'll have to do a bunch of wrangling on our side to get geared up (we do a bunch of cpu affinity and it all assumes at least 16 cores). I need to do some validation that at our workload levels that's even getting us a benefit in the first place (be a lot easier to experiment if we didn't try to do any pinning and let haproxy/nic use the entire machine).

I suppose that's a question for you, do you see enough of a benefit from core pinning/ensuring the nic never touches haproxy cores to make it worth continuing with the work to make it testable with less cores?

It used to be critically important in the old days where we were using nbproc because a connection was affine to a process. With threads and the inbound connection load-balancing we figured that it became unneeded and sometimes even counter-productive, because if for any reason one of your CPUs is less loaded (e.g. NIC queues are imbalanced) you'd prefer the scheduler to place a very busy thread there. And for example if suddenly you see ksoftirq pop up at 100% CPU on a core, the thread running on the same core will be assigned less traffic and it will nicely coexist with this transient situation.

In our ALOHA load balancers we used to carefully pin all NIC IRQs and haproxy IRQs for a decade and recently we simply stopped and for good (and it's much simpler in addition). And not even having a cpu-map directive in the global section makes dynamic adjustments so much easier :-)

There's one case where pinning can make sense, it's when your NIC delivers a single IRQ and a single queue. Then you'd rather pin it and avoid this core for haproxy. But a simple taskset -c is enough in this case.

Just a quick update, and a close for the issue because we're fixed and good:

I switched 1 node over from a c5.4xlarge to a c5.2xlarge (after removing all the nic affinity/cpu-map stuff) and so far it's performing exactly the same as the other nodes in the cluster, so it looks like you're spot on.

We're going to wait for the backports and then move all nodes back to ARM. We're also investigating something like NixOS or some other distribution that would allow us to be much more aggressive in how we compile and run haproxy, rather than relying on distro packages that just use -O2 and nothing else.

Once again, appreciate every bit of help in this issue, we've remained rock solid on ARM since the patch set, so I'm confidently closing this!

Great, thanks fofr the update. I'm reopening just to keep track of the remaining backports.

FYI, the patchset is in pending state for the 2.2. We will wait a bit to be sure no bug is reported on the 2.3 before backporting the patches to the 2.2. The next 2.2 will probably be released without them.

I got some bad news @wtarreau - I rolled out 2.3.3 last night and noticed overnight:

Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]: Thread 8 is about to kill the process.
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:  >Thread 1 : id=0xffff82fe0010 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=1 rqsz=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              stuck=1 prof=0 harmless=0 wantrdv=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cpu_ns: poll=2854747231845 now=2857161391047 diff=2414159202
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              curr_task=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              call trace(8):
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3b3121c [00 07 00 f0 63 a2 27 91]: debug_handler+0x84/0x15c
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xffff839445b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3b16fdc [fc 03 00 aa 40 07 00 b4]: process_runnable_tasks+0x6a4/0x8a8
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:   Thread 2 : id=0xffff82fd4a20 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              stuck=0 prof=0 harmless=0 wantrdv=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cpu_ns: poll=2802952496998 now=2803618136989 diff=665639991
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              curr_task=0xffff5837dcb0 (task) calls=1 last=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:                fct=0xaaaae3a35c50(process_stream) ctx=0xffff7051cb80
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              strm=0xffff7051cb80 src=64.63.32.71 fe=lb_ssl be=fes dst=fes-13
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000130
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              af=(nil),0 csf=0xffff581754c0,104000
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              ab=(nil),0 csb=(nil),0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cof=0xffff5c1d9d60,80001300:H1(0xaaab22573550)/SSL(0xffff683d2070)/tcpv4(10190)
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
an 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:   Thread 3 : id=0xffff80939a20 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=8 rqsz=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              stuck=0 prof=0 harmless=0 wantrdv=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cpu_ns: poll=2748334108579 now=2749102951023 diff=768842444
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              curr_task=0xffff7058a1e0 (task) calls=1 last=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:                fct=0xaaaae3a35c50(process_stream) ctx=0xffff58f252b0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              strm=0xffff58f252b0 src=64.63.32.116 fe=lb_ssl be=fes dst=fes-2
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000130
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              af=(nil),0 csf=0xffff6c06f5d0,104000
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              ab=(nil),0 csb=(nil),0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cof=0xffff696a8580,80001300:H1(0xffff68c7bda0)/SSL(0xffff68891df0)/tcpv4(24316)
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:  >Thread 4 : id=0xffff76fcca20 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=22 rqsz=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              stuck=1 prof=0 harmless=0 wantrdv=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cpu_ns: poll=3001443536439 now=3003849788972 diff=2406252533
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              curr_task=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              call trace(8):
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3b3121c [00 07 00 f0 63 a2 27 91]: debug_handler+0x84/0x15c
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xffff839445b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3afc41c [fc 03 00 aa a0 e8 ff b5]: listener_accept+0x58c/0x21f4
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae39b001c [eb 37 40 f9 ec 47 40 f9]: main+0x35e4
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:  >Thread 5 : id=0xffff767cba20 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=23 rqsz=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              stuck=1 prof=0 harmless=0 wantrdv=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cpu_ns: poll=2729802674399 now=2732029654819 diff=2226980420
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              curr_task=0xffff69840f90 (task) calls=1 last=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:                fct=0xaaaae3a35c50(process_stream) ctx=0xffff58e98bb0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              strm=0xffff58e98bb0 src=103.55.113.48 fe=lb be=fes dst=fes-13
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000130
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              af=(nil),0 csf=0xffff7c5bd260,104000
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              ab=(nil),0 csb=(nil),0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cof=0xffff6903cd60,80001300:H1(0xffff694e22e0)/RAW((nil))/tcpv4(14280)
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              call trace(12):
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3b3121c [00 07 00 f0 63 a2 27 91]: debug_handler+0x84/0x15c
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xffff839445b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a6a668 [44 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27c4
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a6a668 [44 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27c4
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a6c6fc [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a364d8 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cd0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:  >Thread 6 : id=0xffff75fcaa20 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=15 rqsz=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              stuck=1 prof=0 harmless=0 wantrdv=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cpu_ns: poll=2732606787560 now=2734985587842 diff=2378800282
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              curr_task=0xffff70a0afa0 (task) calls=1 last=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:                fct=0xaaaae3a35c50(process_stream) ctx=0xffff7cc99e50
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              strm=0xffff7cc99e50 src=103.104.50.135 fe=lb be=fes dst=fes-2
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000130
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              af=(nil),0 csf=0xffff6c8b41c0,104000
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              ab=(nil),0 csb=(nil),0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cof=0xffff692fe000,80001300:H1(0xffff684287a0)/RAW((nil))/tcpv4(17954)
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              call trace(12):
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3b3121c [00 07 00 f0 63 a2 27 91]: debug_handler+0x84/0x15c
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xffff839445b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a6a668 [44 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27c4
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a6a668 [44 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27c4
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a6c6fc [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a364d8 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cd0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:  >Thread 7 : id=0xffff757c9a20 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=5 rqsz=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              stuck=1 prof=0 harmless=0 wantrdv=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cpu_ns: poll=2713971480983 now=2716386324935 diff=2414843952
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              curr_task=0xffff6c8f50b0 (task) calls=1 last=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:                fct=0xaaaae3a35c50(process_stream) ctx=0xffff7d0f9f50
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              strm=0xffff7d0f9f50 src=139.5.86.68 fe=lb be=fes dst=fes-13
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000130
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              af=(nil),0 csf=0xffff5cb05ba0,104000
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              ab=(nil),0 csb=(nil),0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cof=0xffff687afee0,80001300:H1(0xffff686dc200)/RAW((nil))/tcpv4(15196)
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              call trace(12):
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3b3121c [00 07 00 f0 63 a2 27 91]: debug_handler+0x84/0x15c
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xffff839445b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a6a668 [44 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27c4
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a6a668 [44 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27c4
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a6c6fc [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a364d8 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cd0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]: *>Thread 8 : id=0xffff74fc8a20 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=5 rqsz=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              stuck=1 prof=0 harmless=0 wantrdv=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cpu_ns: poll=2717119300446 now=2719535922006 diff=2416621560
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              curr_task=0xffff60520a70 (task) calls=1 last=0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:                fct=0xaaaae3a35c50(process_stream) ctx=0xffff70e68fd0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              strm=0xffff70e68fd0 src=103.71.26.46 fe=lb_ssl be=fes dst=fes-5
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              rqf=4cd00002 rqa=8000 rpf=80000000 rpa=0 sif=EST,10020 sib=ASS,2000130
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              af=(nil),0 csf=0xffff7c0fd1b0,104000
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              ab=(nil),0 csb=(nil),0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cof=0xffff68d14bd0,80001300:H1(0xffff5818b080)/SSL(0xffff683315f0)/tcpv4(16166)
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              call trace(18):
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3b70528 [f5 ff ff 17 c0 b2 00 91]: wdt_handler+0x158/0x194
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xffff839445b8 [68 11 80 d2 01 00 00 d4]: linux-vdso:__kernel_rt_sigreturn
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a49900 [53 0e 00 f0 42 0e 00 f0]: main+0x9cec8
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a49900 [53 0e 00 f0 42 0e 00 f0]: main+0x9cec8
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a6559c [00 07 00 34 e4 f2 40 f9]: main+0xb8b64
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a6a668 [44 0d 00 d0 f3 03 00 aa]: connect_server+0x15a8/0x27c4
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a6c6fc [74 06 40 f9 94 07 00 b4]: back_try_conn_req+0x114/0x71c
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2873]:              | 0xaaaae3a364d8 [80 03 40 39 1f 04 00 71]: process_stream+0x888/0x5cd0
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2871]: [NOTICE] 010/213933 (2871) : haproxy version is 2.3.3-1ppa1~focal
Jan 11 21:39:33 fes-lb-ap-northeast-1-ip-10-4-20-184 haproxy[2871]: [NOTICE] 010/213933 (2871) : path to executable is /usr/sbin/haproxy

Not sure if everything in the list of fixes made it to 2.3.3, but something is still going on. I'll get one of these machines setup to grab core dumps again like last time!

@apenney-beeswax Did you continue to see a problem here?