BUG: flaky exit-code when using need-app and lazy-apps flags together(+FIX)
Closed this issue · 3 comments
Description:
I created a Python application using Flask and uWSGI and it is managed by Supervisord.
Sometimes, the application crashes on initialization, because of a disconnection from another service.
In order to let Supervisord know that the application has crashed, I use the flag need-app: true
.
So, when the application crashes on initialization, I expect from uWSGI to exit with exit-code 22, and from Supervisord to restart the uWSGI application until it successfully starts.
In order to achieve that I use this flag of Supervisord:
autorestart: unexpected
(Which by default tells Supervidord to restart the application whenever the exit-code is not 0)
This should work well, and actually does work most of the time.
But, the issue is that sometimes uWSGI exits with exit-code 0, even though it crashed on exception and need-app
flag is set to true
.
This behavior is flaky, the application can exit with 0 imidiately, or after 10,100,200,... successive restarts.
Root Cause Analysis:
I digged a little bit into the code and I think that I found the cause for this issue:
I found two patches that address this area in the code:
From 2014:
Issue: #622
PR for fix: f70c070
This is the code that should handle the issue on core/uwsgi.c
in uwsgi_init_all_apps() function:
// no app initialized and virtualhosting enabled
if (uwsgi_apps_cnt == 0 && uwsgi.numproc > 0 && !uwsgi.command_mode) {
if (uwsgi.need_app) {
if (!uwsgi.lazy)
uwsgi_log("*** no app loaded. GAME OVER ***\n");
if (uwsgi.lazy_apps) {
if (uwsgi.master_process) {
if (kill(uwsgi.workers[0].pid, SIGINT)) {
uwsgi_error("kill()");
}
}
}
exit(UWSGI_FAILED_APP_CODE);
}
else {
uwsgi_log("*** no app loaded. going in full dynamic mode ***\n");
}
}
From 2016:
Issue: #1397
PR for fix: 65a8d67
This is the code that should handle the issue on core/master.c
in master_loop()
function:
// first check failed app loading in need-app mode
if (WIFEXITED(waitpid_status) && WEXITSTATUS(waitpid_status) == UWSGI_FAILED_APP_CODE) {
if (uwsgi.lazy_apps && uwsgi.need_app) {
uwsgi_log("OOPS ! failed loading app in worker %d (pid %d)\n", thewid, (int) diedpid);
uwsgi_log_verbose("need-app requested, destroying the instance...\n");
uwsgi.status.dying_for_need_app = 1;
kill_them_all(0);
continue;
}
else {
uwsgi_log("OOPS ! failed loading app in worker %d (pid %d) :( trying again...\n", thewid, (int) diedpid);
}
}
We can see that both patches are killing the process, by calling kill_them_all()
function. One directly calls kill_them_all()
with 0, and the the other by sending SIGINT (2), which is handled in the code and calls kill_them_all()
in core/master.c
in master_loop() function:
uwsgi_unix_signal(SIGINT, kill_them_all);
As we can see in the issue of 2016 (#1397), the complaint is that the process exits with code 0 instead of 22.
Which is what the 2014 PR (f70c070) caused, but after the newer 2016 PR (65a8d67) was merged it was fixed and now exits with code 22.
BUT, the two PRs live together in the code.
And I think that they are having a race-condition over killing of the process (thats what's causing the flakiness)
Reproduce:
- Create a simple python code that throws an exception in
main.py
(thus should crash on initialization):
raise Exception()
Run it with uwsgi, using these flags:
lazy-apps = true
need-app = true
Here's my full uwsgi.ini
:
[uwsgi-app]
auto-procname = true
chdir = /home/app
module = main:wsgi_app
vacuum = true
die-on-term = true
ignore-sigpipe = true
ignore-write-errors = true
disable-write-exception = true
buffer-size = 65535
enable-threads = true
lazy-apps = true
worker-reload-mercy = 14400
need-app = true
socket = /tmp/openresty-uwsgi.sock
socket-timeout = 600
safe-pidfile = /tmp/uwsgi.pid
chmod-socket = 666
- Create a script to run the uWSGI application in a loop and print its exit code until it exits with code 0 (
test.sh
):
#!/bin/bash
MAX_RETRIES=100000
RETRY_INTERVAL=1
EXIT_CODE=0
for (( i=1; i<=$MAX_RETRIES; i++ )); do
echo "Attempt $i/$MAX_RETRIES:"
uwsgi --safe-pidfile /tmp/uwsgi.pid --ini /etc/uwsgi.ini:uwsgi-app --stats 127.0.0.1:1717 --stats-http
EXIT_CODE=$?
if [ $EXIT_CODE -eq 0 ]; then
echo "Process exited successfully"
break
else
echo "Process exited with non-zero exit code: $EXIT_CODE"
if [ $i -lt $MAX_RETRIES ]; then
echo "Retrying in $RETRY_INTERVAL seconds..."
sleep $RETRY_INTERVAL
else
echo "Max retries reached. Exiting..."
break
fi
fi
done
exit $EXIT_CODE
- Add a print to
kill_them_all()
function incore/uwsgi.c
in-order to debug the calls:
// brutally destroy
void kill_them_all(int signum) {
uwsgi_log("kill_them_all called %d \n", signum);
if (uwsgi_instance_is_dying) return;
uwsgi.status.brutally_destroying = 1;
// unsubscribe if needed
uwsgi_unsubscribe_all();
uwsgi_log("SIGINT/SIGTERM received...killing workers...\n");
int i;
for (i = 1; i <= uwsgi.numproc; i++) {
if (uwsgi.workers[i].pid > 0) {
uwsgi_curse(i, SIGINT);
}
}
for (i = 0; i < uwsgi.mules_cnt; i++) {
if (uwsgi.mules[i].pid > 0) {
uwsgi_curse_mule(i, SIGINT);
}
}
uwsgi_destroy_processes();
}
- Also add a print to the
uwsgi_init_all_apps()
function incore/uwsgi.c
to debug when this code gets executed:
// no app initialized and virtualhosting enabled
if (uwsgi_apps_cnt == 0 && uwsgi.numproc > 0 && !uwsgi.command_mode) {
if (uwsgi.need_app) {
if (!uwsgi.lazy)
uwsgi_log("*** no app loaded. GAME OVER ***\n");
if (uwsgi.lazy_apps) {
if (uwsgi.master_process) {
uwsgi_log("Trying to kill twice\n");
if (kill(uwsgi.workers[0].pid, SIGINT)) {
uwsgi_error("kill()");
}
}
}
exit(UWSGI_FAILED_APP_CODE);
}
else {
uwsgi_log("*** no app loaded. going in full dynamic mode ***\n");
}
}
-
Compile uWSGI.
-
Run the script
./test.sh
First run of the script:
Attempt 1/100000:
...
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 49148)
spawned uWSGI worker 1 (pid: 49149, cores: 70)
writing pidfile to /tmp/uwsgi.pid
*** Stats server enabled on [127.0.0.1:1717](http://127.0.0.1:1717/) fd: 9 ***
writing pidfile to /tmp/uwsgi.pid
Traceback (most recent call last):
File "/home/app/./main.py", line 1, in <module>
raise Exception()
Exception
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. GAME OVER ***
Trying to kill twice
kill_them_all called 2
SIGINT/SIGTERM received...killing workers...
worker 1 buried after 1 seconds
goodbye to uWSGI.
VACUUM: safe pidfile removed.
VACUUM: unix socket /tmp/openresty-uwsgi2.sock removed.
Process exited successfully
In this run, on the first attempt this process exited with code 0, we can see that kill_them_all()
called once by receiving SIGINT (2) by the 2014 PR code.
Second run of the script:
Attempt 3/100000:
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 49341)
spawned uWSGI worker 1 (pid: 49342, cores: 70)
writing pidfile to /tmp/uwsgi.pid
*** Stats server enabled on [127.0.0.1:1717](http://127.0.0.1:1717/) fd: 9 ***
writing pidfile to /tmp/uwsgi.pid
Traceback (most recent call last):
File "/home/app/./main.py", line 1, in <module>
raise Exception()
Exception
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. GAME OVER ***
Trying to kill twice
kill_them_all called 2
SIGINT/SIGTERM received...killing workers...
OOPS ! failed loading app in worker 1 (pid 49342)
Tue May 7 16:05:33 2024 - need-app requested, destroying the instance...
kill_them_all called 0
goodbye to uWSGI.
VACUUM: safe pidfile removed.
VACUUM: unix socket /tmp/openresty-uwsgi2.sock removed.
Process exited with non-zero exit code: 22
Retrying in 1 seconds...
Attempt 4/100000:
...
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 49344)
spawned uWSGI worker 1 (pid: 49345, cores: 70)
writing pidfile to /tmp/uwsgi.pid
*** Stats server enabled on 127.0.0.1:1717 fd: 9 ***
writing pidfile to /tmp/uwsgi.pid
Traceback (most recent call last):
File "/home/app/./main.py", line 1, in <module>
raise Exception()
Exception
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. GAME OVER ***
Trying to kill twice
kill_them_all called 2
SIGINT/SIGTERM received...killing workers...
worker 1 buried after 1 seconds
goodbye to uWSGI.
VACUUM: safe pidfile removed.
VACUUM: unix socket /tmp/openresty-uwsgi2.sock removed.
Process exited successfully
Here we got to the 4th attempt, we can see that in the attempt before exiting with 0, it actually exited with 22 (as expected), and in that case we can see two calls for kill_them_all(), one with 0 (2016 patch) and one with 2 (SIGINT from the 2014 patch).
So, the results show that there ’s some race condition between the two patches over killing the process.
As we can see, when we kill by the first patch (2014) we exit with code 0 which is not the desired behavior, and when we kill by the second (2016) patch we get the desired behavior, the process exit with code 22, so other process managers like Supervisord could restart it in case of failure.
Solution:
Let’s remove the old patch from 2014 (here I commented it), compile, and see what happens:
// no app initialized and virtualhosting enabled
if (uwsgi_apps_cnt == 0 && uwsgi.numproc > 0 && !uwsgi.command_mode) {
if (uwsgi.need_app) {
if (!uwsgi.lazy)
uwsgi_log("*** no app loaded. GAME OVER ***\n");
// if (uwsgi.lazy_apps) {
// if (uwsgi.master_process) {
uwsgi_log("Trying to kill twice\n");
//if (kill(uwsgi.workers[0].pid, SIGINT)) {
// uwsgi_error("kill()");
//}
// }
// }
exit(UWSGI_FAILED_APP_CODE);
}
else {
uwsgi_log("*** no app loaded. going in full dynamic mode ***\n");
}
}
I re-ran the script after compiling uWSGI with the fix, and the process exit-code stayed stable on 22 for all exits as expected.
Here we see 83965 attempts, an attempt every one sec, so for almost 24 hours of successive restats of uWSGI the exit-code stayed 22 for every exit:
Attempt 83965/100000:
...
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 61423)
spawned uWSGI worker 1 (pid: 61424, cores: 70)
writing pidfile to /tmp/uwsgi.pid
*** Stats server enabled on [127.0.0.1:1717](http://127.0.0.1:1717/) fd: 9 ***
writing pidfile to /tmp/uwsgi.pid
Traceback (most recent call last):
File "/home/app/./main.py", line 1, in <module>
raise Exception()
Exception
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. GAME OVER ***
OOPS ! failed loading app in worker 1 (pid 61424)
Thu May 9 17:26:04 2024 - need-app requested, destroying the instance...
kill_them_all called 0
SIGINT/SIGTERM received...killing workers...
goodbye to uWSGI.
VACUUM: safe pidfile removed.
VACUUM: unix socket /tmp/openresty-uwsgi2.sock removed.
Process exited with non-zero exit code: 22
With which version of uwsgi have you replicated this?