unbit/uwsgi

BUG: flaky exit-code when using need-app and lazy-apps flags together(+FIX)

Closed this issue · 3 comments

Description:

I created a Python application using Flask and uWSGI and it is managed by Supervisord.
Sometimes, the application crashes on initialization, because of a disconnection from another service.
In order to let Supervisord know that the application has crashed, I use the flag need-app: true.
So, when the application crashes on initialization, I expect from uWSGI to exit with exit-code 22, and from Supervisord to restart the uWSGI application until it successfully starts.
In order to achieve that I use this flag of Supervisord:

autorestart: unexpected

(Which by default tells Supervidord to restart the application whenever the exit-code is not 0)

This should work well, and actually does work most of the time.

But, the issue is that sometimes uWSGI exits with exit-code 0, even though it crashed on exception and need-app flag is set to true.
This behavior is flaky, the application can exit with 0 imidiately, or after 10,100,200,... successive restarts.

Root Cause Analysis:

I digged a little bit into the code and I think that I found the cause for this issue:

I found two patches that address this area in the code:

From 2014:
Issue: #622
PR for fix: f70c070

This is the code that should handle the issue on core/uwsgi.c in uwsgi_init_all_apps() function:

	// no app initialized and virtualhosting enabled
	if (uwsgi_apps_cnt == 0 && uwsgi.numproc > 0 && !uwsgi.command_mode) {
		if (uwsgi.need_app) {
			if (!uwsgi.lazy)
				uwsgi_log("*** no app loaded. GAME OVER ***\n");
			if (uwsgi.lazy_apps) {
				if (uwsgi.master_process) {
					if (kill(uwsgi.workers[0].pid, SIGINT)) {
						uwsgi_error("kill()");
					}
				}
			}
			exit(UWSGI_FAILED_APP_CODE);
		}
		else {
			uwsgi_log("*** no app loaded. going in full dynamic mode ***\n");
		}
	}

From 2016:
Issue: #1397
PR for fix: 65a8d67

This is the code that should handle the issue on core/master.c in master_loop() function:

// first check failed app loading in need-app mode
if (WIFEXITED(waitpid_status) && WEXITSTATUS(waitpid_status) == UWSGI_FAILED_APP_CODE) {
	if (uwsgi.lazy_apps && uwsgi.need_app) {
		uwsgi_log("OOPS ! failed loading app in worker %d (pid %d)\n", thewid, (int) diedpid);
		uwsgi_log_verbose("need-app requested, destroying the instance...\n");
		uwsgi.status.dying_for_need_app = 1;
		kill_them_all(0);
		continue;
	}
	else {
		uwsgi_log("OOPS ! failed loading app in worker %d (pid %d) :( trying again...\n", thewid, (int) diedpid);
	}
}

We can see that both patches are killing the process, by calling kill_them_all() function. One directly calls kill_them_all() with 0, and the the other by sending SIGINT (2), which is handled in the code and calls kill_them_all() in core/master.c in master_loop() function:

uwsgi_unix_signal(SIGINT, kill_them_all);

As we can see in the issue of 2016 (#1397), the complaint is that the process exits with code 0 instead of 22.
Which is what the 2014 PR (f70c070) caused, but after the newer 2016 PR (65a8d67) was merged it was fixed and now exits with code 22.
BUT, the two PRs live together in the code.
And I think that they are having a race-condition over killing of the process (thats what's causing the flakiness)

Reproduce:

  1. Create a simple python code that throws an exception in main.py (thus should crash on initialization):
raise Exception()

Run it with uwsgi, using these flags:
lazy-apps = true
need-app = true

Here's my full uwsgi.ini:

[uwsgi-app]
auto-procname = true
chdir = /home/app
module = main:wsgi_app
vacuum  = true
die-on-term = true
ignore-sigpipe = true
ignore-write-errors  = true
disable-write-exception = true
buffer-size = 65535
enable-threads = true
lazy-apps = true
worker-reload-mercy = 14400
need-app  = true
socket = /tmp/openresty-uwsgi.sock
socket-timeout = 600
safe-pidfile = /tmp/uwsgi.pid
chmod-socket = 666
  1. Create a script to run the uWSGI application in a loop and print its exit code until it exits with code 0 ( test.sh):
#!/bin/bash
MAX_RETRIES=100000
RETRY_INTERVAL=1
EXIT_CODE=0
for (( i=1; i<=$MAX_RETRIES; i++ )); do
    echo "Attempt $i/$MAX_RETRIES:"
    uwsgi --safe-pidfile /tmp/uwsgi.pid --ini /etc/uwsgi.ini:uwsgi-app --stats 127.0.0.1:1717 --stats-http
    EXIT_CODE=$?

    if [ $EXIT_CODE -eq 0 ]; then
        echo "Process exited successfully"
        break
    else
        echo "Process exited with non-zero exit code: $EXIT_CODE"

        if [ $i -lt $MAX_RETRIES ]; then
            echo "Retrying in $RETRY_INTERVAL seconds..."
            sleep $RETRY_INTERVAL
        else
            echo "Max retries reached. Exiting..."
            break
        fi
    fi
done

exit $EXIT_CODE
  1. Add a print to kill_them_all() function in core/uwsgi.c in-order to debug the calls:
// brutally destroy
void kill_them_all(int signum) {
        uwsgi_log("kill_them_all called %d \n", signum);

        if (uwsgi_instance_is_dying) return;

        uwsgi.status.brutally_destroying = 1;

        // unsubscribe if needed

        uwsgi_unsubscribe_all();

        uwsgi_log("SIGINT/SIGTERM received...killing workers...\n");

        int i;

        for (i = 1; i <= uwsgi.numproc; i++) {
                if (uwsgi.workers[i].pid > 0) {
                        uwsgi_curse(i, SIGINT);
                }
        }

        for (i = 0; i < uwsgi.mules_cnt; i++) {
                if (uwsgi.mules[i].pid > 0) {
                        uwsgi_curse_mule(i, SIGINT);
                }
        }

        uwsgi_destroy_processes();
}
  1. Also add a print to the uwsgi_init_all_apps() function in core/uwsgi.c to debug when this code gets executed:
// no app initialized and virtualhosting enabled
if (uwsgi_apps_cnt == 0 && uwsgi.numproc > 0 && !uwsgi.command_mode) {
        if (uwsgi.need_app) {
                if (!uwsgi.lazy)
                        uwsgi_log("*** no app loaded. GAME OVER ***\n");

                if (uwsgi.lazy_apps) {
                        if (uwsgi.master_process) {
                                uwsgi_log("Trying to kill twice\n");
                                if (kill(uwsgi.workers[0].pid, SIGINT)) {
                                      uwsgi_error("kill()");
                                }
                        }
                }
                exit(UWSGI_FAILED_APP_CODE);
        }
        else {
                uwsgi_log("*** no app loaded. going in full dynamic mode ***\n");
        }
}
  1. Compile uWSGI.

  2. Run the script ./test.sh

First run of the script:

Attempt 1/100000:
...
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 49148)
spawned uWSGI worker 1 (pid: 49149, cores: 70)
writing pidfile to /tmp/uwsgi.pid
*** Stats server enabled on [127.0.0.1:1717](http://127.0.0.1:1717/) fd: 9 ***
writing pidfile to /tmp/uwsgi.pid
Traceback (most recent call last):
  File "/home/app/./main.py", line 1, in <module>
    raise Exception()
Exception
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. GAME OVER ***
Trying to kill twice
kill_them_all called 2 
SIGINT/SIGTERM received...killing workers...
worker 1 buried after 1 seconds
goodbye to uWSGI.
VACUUM: safe pidfile removed.
VACUUM: unix socket /tmp/openresty-uwsgi2.sock removed.
Process exited successfully

In this run, on the first attempt this process exited with code 0, we can see that kill_them_all() called once by receiving SIGINT (2) by the 2014 PR code.

Second run of the script:

Attempt 3/100000:
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 49341)
spawned uWSGI worker 1 (pid: 49342, cores: 70)
writing pidfile to /tmp/uwsgi.pid
*** Stats server enabled on [127.0.0.1:1717](http://127.0.0.1:1717/) fd: 9 ***
writing pidfile to /tmp/uwsgi.pid
Traceback (most recent call last):
  File "/home/app/./main.py", line 1, in <module>
    raise Exception()
Exception
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. GAME OVER ***
Trying to kill twice
kill_them_all called 2 
SIGINT/SIGTERM received...killing workers...
OOPS ! failed loading app in worker 1 (pid 49342)
Tue May  7 16:05:33 2024 - need-app requested, destroying the instance...
kill_them_all called 0 
goodbye to uWSGI.
VACUUM: safe pidfile removed.
VACUUM: unix socket /tmp/openresty-uwsgi2.sock removed.
Process exited with non-zero exit code: 22
Retrying in 1 seconds...

Attempt 4/100000:
...
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 49344)
spawned uWSGI worker 1 (pid: 49345, cores: 70)
writing pidfile to /tmp/uwsgi.pid
*** Stats server enabled on 127.0.0.1:1717 fd: 9 ***
writing pidfile to /tmp/uwsgi.pid
Traceback (most recent call last):
  File "/home/app/./main.py", line 1, in <module>
    raise Exception()
Exception
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. GAME OVER ***
Trying to kill twice
kill_them_all called 2 
SIGINT/SIGTERM received...killing workers...
worker 1 buried after 1 seconds
goodbye to uWSGI.
VACUUM: safe pidfile removed.
VACUUM: unix socket /tmp/openresty-uwsgi2.sock removed.
Process exited successfully

Here we got to the 4th attempt, we can see that in the attempt before exiting with 0, it actually exited with 22 (as expected), and in that case we can see two calls for kill_them_all(), one with 0 (2016 patch) and one with 2 (SIGINT from the 2014 patch).
So, the results show that there ’s some race condition between the two patches over killing the process.
As we can see, when we kill by the first patch (2014) we exit with code 0 which is not the desired behavior, and when we kill by the second (2016) patch we get the desired behavior, the process exit with code 22, so other process managers like Supervisord could restart it in case of failure.

Solution:

Let’s remove the old patch from 2014 (here I commented it), compile, and see what happens:

// no app initialized and virtualhosting enabled
if (uwsgi_apps_cnt == 0 && uwsgi.numproc > 0 && !uwsgi.command_mode) {
        if (uwsgi.need_app) {
                if (!uwsgi.lazy)
                        uwsgi_log("*** no app loaded. GAME OVER ***\n");
                // if (uwsgi.lazy_apps) {
                        // if (uwsgi.master_process) {
                                uwsgi_log("Trying to kill twice\n");
                                //if (kill(uwsgi.workers[0].pid, SIGINT)) {
                                //      uwsgi_error("kill()");
                                //}
                        // }
                // }
                exit(UWSGI_FAILED_APP_CODE);
        }
        else {
                uwsgi_log("*** no app loaded. going in full dynamic mode ***\n");
        }
}

I re-ran the script after compiling uWSGI with the fix, and the process exit-code stayed stable on 22 for all exits as expected.
Here we see 83965 attempts, an attempt every one sec, so for almost 24 hours of successive restats of uWSGI the exit-code stayed 22 for every exit:

Attempt 83965/100000:
...
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 61423)
spawned uWSGI worker 1 (pid: 61424, cores: 70)
writing pidfile to /tmp/uwsgi.pid
*** Stats server enabled on [127.0.0.1:1717](http://127.0.0.1:1717/) fd: 9 ***
writing pidfile to /tmp/uwsgi.pid
Traceback (most recent call last):
  File "/home/app/./main.py", line 1, in <module>
    raise Exception()
Exception
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. GAME OVER ***
OOPS ! failed loading app in worker 1 (pid 61424)
Thu May  9 17:26:04 2024 - need-app requested, destroying the instance...
kill_them_all called 0
SIGINT/SIGTERM received...killing workers...
goodbye to uWSGI.
VACUUM: safe pidfile removed.
VACUUM: unix socket /tmp/openresty-uwsgi2.sock removed.
Process exited with non-zero exit code: 22

@xrmx can you please help push this fix?

xrmx commented

With which version of uwsgi have you replicated this?

@xrmx
I initially encountered this with version 2.0.22.
But when I compiled uWSGI locally and tested it, I pulled the latest code from the "master" branch (1-2 weeks ago).