ZeroMQ IPC fails after a while
Closed this issue · 5 comments
This has been an issue for a while. In a development environment, zeromq works perfectly fine, however not long after a restart of the production code zeromq requests will start failing silently. This makes vote rewards not work as well as all GET endpoints used for the website, rendering it nearly completely useless. Several attempted fixes were implemented but none have worked so far.
This issue occurs in these lines of code:
Sever:
Lines 27 to 57 in 7bf697e
Client:
Lines 30 to 50 in 7bf697e
I suspected this was because of too many open connections but I am not sure if this is the case and I seem to close all connections. This is the output of an lsof command when this issue occurred in production:
Because this has been a longer ongoing issue and because it is quite important for the functionality I am turning this into an issue to keep track on the progress.
I have also asked this stack overflow question in hopes of a fix.
This seems to be an issue with the API, not zeromq. I can still internally request zeromq however the API fails. I remember it failing after a while before I created the website from time to time, it seems with the large number of additional requests this happens much faster. Only I am not sure why. I will continue investigating.
I have changed hypercorn to use 8 workers instead of 1 a few days ago and this seems to have helped this issue. The API has been without issue for multiple days now.
This issue is not resolved sadly. It is definitely a hypercorn issue. Increasing the number of workers only delays when the API starts timing out. I am looking into solutions.
This now may be resolved. While rewriting this API to rust, I believe I have found the root cause of this issue with the help of @y21.
The root cause was that zeromq, for some reason, in its default behaviour, prevents dropping pointers at the end of a function. So when my make_request
function ends and everything up until that point worked as expected, it tries to drop the variables but is prevented continuously.
This means no error is raised but the code freezes at a low level which is insanely hard to trace.
Turns out this is default zmq behaviour but there thankfully is a method to change this behaviour. So a simple one line fixes this:
socket.set_linger(0)
That's it. That I what I have tried to find for 8 months. Hopefully this actually fixes it. I will keep this issue open for a bit, if I close it that was it.