Segfaults in Python 3.11 on UWSGI worker close
charlieevett opened this issue · 8 comments
We are using Python Pulsar producer clients in a web app that runs in UWSGI. We have our settings set to terminate and relaunch workers after a certain number of requests. We recently upgrade to Python 3.11 and the newest Pulsar client (3.3.0). At that point we started to see segfaults as the worker process was exiting. Here is the traceback that gets logged:
!!! uWSGI process 85 got Segmentation Fault !!! *** backtrace of 85 *** uwsgi(uwsgi_backtrace+0x2f) [0x556ab933fcaf] uwsgi(uwsgi_segfault+0x23) [0x556ab9340083] /lib/x86_64-linux-gnu/libc.so.6(+0x3bfd0) [0x7f845c457fd0] /usr/local/lib/python3.11/site-packages/pulsar_client.libs/libpulsar-2dce0d9f.so(+0x2b387a) [0x7f845770087a] /usr/local/lib/python3.11/site-packages/_pulsar.cpython-311-x86_64-linux-gnu.so(+0xaf752) [0x7f8457f99752] /usr/local/lib/python3.11/site-packages/_pulsar.cpython-311-x86_64-linux-gnu.so(+0x5399e) [0x7f8457f3d99e] /usr/local/lib/python3.11/site-packages/_pulsar.cpython-311-x86_64-linux-gnu.so(+0x545ef) [0x7f8457f3e5ef] /usr/local/lib/libpython3.11.so.1.0(+0x228f8a) [0x7f845c825f8a] /usr/local/lib/libpython3.11.so.1.0(+0x1f4467) [0x7f845c7f1467] /usr/local/lib/libpython3.11.so.1.0(+0x1f4030) [0x7f845c7f1030] /usr/local/lib/libpython3.11.so.1.0(+0x29941f) [0x7f845c89641f] /usr/local/lib/libpython3.11.so.1.0(+0x2993c5) [0x7f845c8963c5] /usr/local/lib/libpython3.11.so.1.0(+0x1ed587) [0x7f845c7ea587] /usr/local/lib/libpython3.11.so.1.0(_PyModule_ClearDict+0x115) [0x7f845c82bfe5] /usr/local/lib/libpython3.11.so.1.0(+0x278bb6) [0x7f845c875bb6] /usr/local/lib/libpython3.11.so.1.0(Py_FinalizeEx+0x12b) [0x7f845c8616ab] uwsgi(uwsgi_plugins_atexit+0x71) [0x556ab933e121] /lib/x86_64-linux-gnu/libc.so.6(+0x3e4dd) [0x7f845c45a4dd] /lib/x86_64-linux-gnu/libc.so.6(+0x3e61a) [0x7f845c45a61a] uwsgi(+0x3df6f) [0x556ab92f5f6f] uwsgi(simple_goodbye_cruel_world+0x59) [0x556ab933f619] uwsgi(+0x87658) [0x556ab933f658] uwsgi(uwsgi_close_request+0x4d9) [0x556ab92f6bf9] uwsgi(simple_loop_run+0xc0) [0x556ab933bbc0] uwsgi(simple_loop+0x7b) [0x556ab933bc8b] uwsgi(uwsgi_ignition+0x279) [0x556ab93403b9] uwsgi(uwsgi_worker_run+0x25b) [0x556ab93449db] uwsgi(uwsgi_run+0x43a) [0x556ab9344f2a] uwsgi(+0x3aede) [0x556ab92f2ede] /lib/x86_64-linux-gnu/libc.so.6(+0x271ca) [0x7f845c4431ca] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f845c443285] uwsgi(_start+0x21) [0x556ab92f2f01] *** end of backtrace ***
The resulting core files are filling up the server disks. We could ulimit them but we'd prefer to avoid that.
Is there a simple (hand-to-hand) way to reproduce? Could it be reproduced against 3.2.0?
We did try 3.2.0 with the same results. We see the issue frequently, but I'm not sure how to reproduce on command, since UWSGI is initiating the process shutdown.
It worked fine with Python 3.7 and the 3.0 client.
- How do you run uwsgi? Could you share an example config?
- How do you initiate and close a pulsar client? Could you share a code example?
Sure, here is most of the uwsgi.ini file we use:
[uwsgi]
http = :$(PORT)
add-header = Connection: close
master = true
processes = $(PROCESSES)
die-on-term = true
memory-report = true
vacuum=True
log-x-forwarded-for=True
log-format = %(addr) - %(user) [%(ltime)] "%(method) %(uri) %(proto)" %(status) %(size) ("%(referer)" "%(uagent)" | %(msecs)
listen = 128
threads = true
max-requests = 750 ; Restart workers after this many requests
max-worker-lifetime = 1800 ; Restart workers after this many seconds
reload-on-rss = 256 ; Restart workers after this much resident memory
worker-reload-mercy = 60 ; How long to wait before forcefully killing workers
uwsgi is run using a very basic command in a Docker container:
$ uwsgi uwsgi.ini
The issue we're seeing is with the max-requests or max-worker-lifetime handling, when these limits are hit and uwsgi kills the worker process, it will often segfault during the exit. Relaunching process is fine, so the main issue is the file system filling up with core files.
I reproduced the bug.
It's similar to unbit/uwsgi#1969.
I see two possible fast solutions:
- Mv from
dict
state toobject
(code). Now segfaults gone:...The work of process 32 is done. Seeya! worker 2 killed successfully (pid: 32) Respawned uWSGI worker 2 (new pid: 38)
- Mv from
uwsgi
togunicorn
(code, conf). Gunicorn is fast and simple server. In this case, it gives more control over the life cycle of workers with hooks.
Thank you! We'll look into the workarounds, both seem plausible.
@gromsterus Many thanks for the investigation and suggested workarounds. Just wanted to mention the following in case it's helpful:
- this appears to be an issue starting in Python 3.9
- here's a piece from one of the uwsgi core dump files that might be helpful:
Thread 2 (Thread 0x7f479172e7c0 (LWP 55)):
#0 0x00007f4793a040c9 in _int_free (av=0x7f4793b40c60 <main_arena>, p=<optimized out>, have_lock=<optimized out>, have_lock@entry=0) at ./malloc/malloc.c:4698
#1 0x00007f4793a06d9f in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3385
#2 0x00007f477dd7d869 in ?? () from /lib/x86_64-linux-gnu/libgdal.so.32
#3 0x00007f477dd816ab in std::map<char, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<char>, std::allocator<std::pair<char const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::~map() () from /lib/x86_64-linux-gnu/libgdal.so.32
#4 0x00007f47939ac4dd in __run_exit_handlers (status=1, listp=0x7f4793b40820 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:116
#5 0x00007f47939ac61a in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:146
#6 0x0000561a24866f6f in uwsgi_exit ()
#7 0x0000561a248b10b2 in uwsgi_segfault ()
#8 <signal handler called>
#9 0x0000561a289e5730 in ?? ()
#10 0x00007f478ed8c87d in pulsar::ProducerImpl::~ProducerImpl() () from /usr/local/lib/python3.11/site-packages/pulsar_client.libs/libpulsar-2dce0d9f.so
#11 0x00007f478f625752 in pybind11::class_<pulsar::Producer>::dealloc(pybind11::detail::value_and_holder&) () from /usr/local/lib/python3.11/site-packages/_pulsar.cpython-311-x86_64-linux-gnu.so
#12 0x00007f478f5c999e in pybind11::detail::clear_instance(_object*) () from /usr/local/lib/python3.11/site-packages/_pulsar.cpython-311-x86_64-linux-gnu.so
#13 0x00007f478f5ca5ef in pybind11_object_dealloc () from /usr/local/lib/python3.11/site-packages/_pulsar.cpython-311-x86_64-linux-gnu.so
#14 0x00007f4793d77f8a in _PyType_SUPPORTS_WEAKREFS (type=0x561a280e3bc0) at ./Include/internal/pycore_object.h:242
#15 PyObject_ClearWeakRefs (object=0x561a28244760) at Objects/weakrefobject.c:948
#16 0x00007f479400ad00 in _Py_TrueStruct () from /usr/local/lib/libpython3.11.so.1.0
#17 0x0000000000000000 in ?? ()