stub_status counter leak when killing old workers
Opened this issue · 3 comments
Environment
Include the result of the following commands:
nginx -V
nginx version: nginx/1.27.1
built by gcc 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
built with OpenSSL 3.0.2 15 Mar 2022
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-http_v3_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -ffile-prefix-map=/data/builder/debuild/nginx-1.27.1/debian/debuild-base/nginx-1.27.1=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie'
uname -a
Linux lb 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Description
Describe the bug in full detail including expected and actual behavior.
Specify conditions that caused it. Provide the relevant part of nginx
configuration and debug log.
- The bug is reproducible with the latest version of nginx
- The nginx configuration is minimized to the smallest possible
to reproduce the issue and doesn't contain third-party modules
Hello,
we've started killing (sending SIGTERM) "old" nginx workers (nginx: worker process is shutting down) as we have regular configuration changes and a lot of websocket connections.
Since we do this, the counters from stub_status are incorrect.
nginx status
> curl localhost/nginx_status
Active connections: 65369
server accepts handled requests
1042173178 1042173178 5035465167
Reading: 0 Writing: 31968 Waiting: 5356
Adding up Writing and Waiting it's just 37324 instead of the 65369 "active connections". But even 37324 is too high. The correct number should be around this:
> ss | grep https | wc -l
7327
It's reproducable e.g. using echo.websocket.org:
nginx configuration
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
server {
listen 1234;
location / {
proxy_set_header Host echo.websocket.org;
proxy_ssl_server_name on;
proxy_ssl_name echo.websocket.org;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_pass https://echo.websocket.org:443;
}
location = /basic_status {
stub_status;
}
}
Test
Open 2 tabs in the browser with the URL: http://localhost:1234/.ws
Now there are 3 active connection (one connection is the request to /basic_status):
# curl localhost:1234/basic_status
Active connections: 3
server accepts handled requests
4 4 4
Reading: 0 Writing: 3 Waiting: 0
After that, reload the nginx process and you can see a nginx: worker process is shutting down process:
# systemctl reload nginx
# ps aux | grep [n]ginx
root 2233850 0.0 0.0 55372 5652 ? Ss 22:29 0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
nginx 2233851 0.0 0.0 55896 6044 ? S 22:29 0:00 nginx: worker process is shutting down
nginx 2233940 0.0 0.0 55868 5400 ? S 22:30 0:00 nginx: worker process
nginx 2233941 0.0 0.0 55868 5240 ? S 22:30 0:00 nginx: worker process
# kill 2233851
After we've killed the old process the websocket client in the browser will reconnect. After that we get two additional connection from stub_status even as the old connections are gone:
# curl localhost:1234/basic_status
Active connections: 5
server accepts handled requests
7 7 7
Reading: 0 Writing: 5 Waiting: 0
Only the worker processes are aware of the connections and can decrement the counters on connection closure. As soon as you terminate the process, all the connection state it held is gone, and we're no longer able to guarantee a valid state of the counters.
Consider using worker_shutdown_timeout instead.
Thank you for the feedback. According to the documentation I thought it would be ok:
Individual worker processes can be controlled with signals as well, though it is not required. The supported signals are:
TERM, INT | fast shutdown
[...]
I assumed sending a TERM
will behave similar as when the worker shutdown timeout happens.
I assumed sending a
TERM
will behave similar as when the worker shutdown timeout happens.
The documentation says that SIGTERM and SIGINT invoke fast shutdown but doesn't explain the difference between the "fast" and the "graceful" shutdown procedures.
Graceful shutdown closes the idle connections and waits until the remaining clients are served.
The worker_shutdown_timeout
timer only runs during the graceful shutdown, assumes that it's fine to delay the exit and calls the appropriate close methods for the remaining connections. That includes updating the stats and sending the necessary protocol packets (GOAWAY
frame for HTTP/2 and HTTP/3).
Fast shutdown does a very minimal set of things before exiting: it runs exit_process handlers for the configured modules, logs a message and that's it. Notably, that does not include closing the connections, sending any packets or doing anything else that may block or delay the exit. All the remaining cleanup is left to the OS.
I don't think we want to change this behavior, but clarifying the consequences in the documentation could be useful.