server becomes unresponsive

the lapse project which makes use of prime_server for the sake of an image file server has a button for cycling through pictures. if you highlight the button in a web browser and just hold down the enter key, you'll make a lot of requests in short succession. this after a short time the server stops noticing requests and completely stops answering them as well.

this has been also confirmed by @slvlirnoff via issue valhalla/valhalla#798

Hi Kevin,

Is there an update to this issue? I was stress testing the Valhalla isochrone endpoint in AWS on a C5d.4xlarge, and it hums along at ~100 rps but became unresponsive at around 150 rps. We can see the server still accepting requests but no responses are returned. Any idea as to what is going on? I tried to debug by logging the message queue size for each worker, but did not have any success.

Thanks

V

i have been thinking about this recently but havent had the time to look into it. what we need to do is figure out at what stage of the pipeline messages stop coming through. that would be the first step. once we know that we can write a test using just those affected sockets pumping simple messages directly to each other to reproduce the behavior. the default behavior in zmq is to set the high water mark for a socket to unlimited and off the top of my head i dont remember configuring something other than that. anyway yeah the main goal here will be to make a minimal test.

I don't see anywhere in the code configuration of the high watermark limit so I assume the default is used, but according to zmq documentation the default limit is 1000. Also from the documentation,

If this limit has been reached the socket shall enter an exceptional state and depending on the socket type, ØMQ shall take appropriate action such as blocking or dropping sent messages.

I'll try to reproduce the issue as you mentioned, but it may take some time since this is my first experience with zmq.

interesting! that changed from version 2 where the default was 0 which meant unlimited! we do set them though in the code to 0. if im not mistaken all of the various sockets in prime_server are set to 0 for the HWM:

prime_server/src/prime_server.cpp

Line 87 in 9324d4f

client.setsockopt(ZMQ_SNDHWM, &disabled, sizeof(disabled));

ahh yes, i see that now. Interestingly, the server becomes unresponsive only after a heavy load test, but it is fine during the test (1,300 rps). I was able to replicate the issue only with valhalla but not prime_serverd. I used gatling and locust to make sure it wasn't an issue with either tool. Makes me think that there might be an issue in cleaning up tcp connections.

some sys settings:

/proc/sys/net/ipv4/tcp_keepalive_time >> 60

/etc/security/limits.conf

*               soft    core            65535
*               hard    core            100000
*               hard    nofile          65535
*               soft    nofile          65535

the default for prime_serverd is to run in a mode where it does lots of requests from a single client in batch. in this mode it uses an ipc socket, not tcp. you can run it in tcp/http server mode and then hit it with your program of choice though by just passing it a single argument (the address to bind on). i was able to reproduce this quite a long time ago by writing a web page that scrolled through fotos where the webpage was served by prime_filed

with regard to cleaning up tcp connections. i remember the docs saying all you have to do to close the connection is to send an empty message which we do here if you aren't using a keepalive connection: https://github.com/kevinkreiser/prime_server/blob/master/src/prime_server.cpp#L266

so one interesting question would be, does this still happen if you set your hammer program to not use keep-alive. perhaps its the servers willingness to keep alive forever? do note that when the client chooses to disconnect zmq should clean that up so far as im aware.

Got around to testing the keep-alive headers during stress testing. Reduced timeouts to zero (was around 15% before), but still had the same issue of the server being unresponsive. Don't know how to explain it at this time, but oddly enough, when placing the server behind a load balancer, the issue was resolved. Still needs more investigation on my part. Will keep you posted.

this was my experience as well. something about how the load balancer pools or doesnt pool connections or performs rate limiting might mask the issue. since ive always deployed this in production systems behind such other systems ive never had trouble but yeah, i would really like to find and squash this bug! thanks so much for your hard work in looking at the issue!

Hey @kevinkreiser , @val-gh do you have any update on this issue? It still keep on causing me trouble and I had to setup some kind of mitigation that restart the service as soon as it becomes unresponsive.
Could you describe your load balancer setup? I still have this problem when the server is behind a nginx proxy_pass.

@slvlirnoff in production scenarios people will usually run this behind a rate limiter (to prevent abuse). In my experience that is all that is needed to avoid this issue. Sadly I havent had time to try to repoduce this issue on my own. Perhaps this December I will find the time :/ in the mean time many apologies

I've been able to reproduce this pretty reliably using a load test script and HTTP/1.0 (so no keepalive). After adding some debug statements, this is my understanding of what happens:

In line 260 in prime_server.cpp, the response is sent to the client.
Sometimes, the client manages to close the connection immediately afterwards.
This leads to the call in line 265 returning 0.
The call in line 266 succeeds again, but it's the wrong kind of message from ZMQ point of view (not an identity frame).

Per ZMQ documentation, the socket is now in a state where it will never, ever send data to clients again because it unexpectedly received a message without ZMQ_SNDMORE: "Also, please note that omitting the ZMQ_SNDMORE flag will prevent sending further data (from any client) on the same socket." ZMQ 4.1 socket docs

While I observed this using HTTP/1.0, it could also happen anytime a client closes a connection at an "unfortunate" time. To prevent this, I inserted return value checks whenever data is sent to a client (see patch below). After that, I was no longer able to reproduce the problem. Perhaps there should be more checks for errors in other places, but the most critical part is the one using a ZMQ_STREAM socket where client behavior is unpredictable.

Maybe ZMQ 4.2 includes better handling of this situation. At least the note about "[no] further data (from any client)" is gone in the 4.2 docs. However, Ubuntu bionic (18.04LTS) still uses 4.1, and the additional error checks can't hurt (even if newer ZMQ versions might be more robust).

diff --git a/src/prime_server.cpp b/src/prime_server.cpp
index bc3ac4ca..c3f1798f 100644
--- a/src/prime_server.cpp
+++ b/src/prime_server.cpp
@@ -202,11 +202,13 @@ namespace prime_server {
       if(session != sessions.end()) {
         //proxy any whole bits onward, or if that failed (malformed or large request) close the session
         if(!enqueue(session->first, body, session->second)) {
-          client.send(session->first, ZMQ_SNDMORE | ZMQ_DONTWAIT);
-          client.send(static_cast<const void*>(""), 0, ZMQ_DONTWAIT);
-          for(auto id_time_stamp : session->second.enqueued)
-            interrupt.send(static_cast<void*>(&id_time_stamp), sizeof(id_time_stamp), ZMQ_DONTWAIT);
-          sessions.erase(session);
+          bool success = client.send(session->first, ZMQ_SNDMORE | ZMQ_DONTWAIT);
+          if(success) success = client.send(static_cast<const void*>(""), 0, ZMQ_DONTWAIT);
+          if(success) {
+            for(auto id_time_stamp : session->second.enqueued)
+              interrupt.send(static_cast<void*>(&id_time_stamp), sizeof(id_time_stamp), ZMQ_DONTWAIT);
+            sessions.erase(session);
+          }
         }
       }
       else
@@ -222,8 +224,8 @@ namespace prime_server {
       parsed_requests = request.from_stream(static_cast<const char*>(message.data()), message.size(), max_request_size);
     }//something went wrong, either in parsing or size limitation
     catch(const typename request_container_t::request_exception_t& e) {
-      client.send(requester, ZMQ_SNDMORE | ZMQ_DONTWAIT);
-      client.send(e.response, ZMQ_DONTWAIT);
+      bool success = client.send(requester, ZMQ_SNDMORE | ZMQ_DONTWAIT);
+      if(success) client.send(e.response, ZMQ_DONTWAIT);
       if(log) {
         request.log(request_id);
         e.log(request_id);
@@ -256,18 +258,20 @@ namespace prime_server {
     if(request == requests.cend())
       return false;
     //reply to the client with the response or an error
-    client.send(request->second, ZMQ_SNDMORE | ZMQ_DONTWAIT);
-    client.send(response, ZMQ_DONTWAIT);
+    bool success = client.send(request->second, ZMQ_SNDMORE | ZMQ_DONTWAIT);
+    if(success) client.send(response, ZMQ_DONTWAIT);
     if(log)
       info.log(response.size());
     //cleanup, session may or may not be keep alive
     if(!info.keep_alive()){
-      this->client.send(request->second, ZMQ_DONTWAIT | ZMQ_SNDMORE);
-      this->client.send(static_cast<const void*>(""), 0, ZMQ_DONTWAIT);
-      auto session = sessions.find(request->second);
-      for(auto id_time_stamp : session->second.enqueued)
-        interrupt.send(static_cast<void*>(&id_time_stamp), sizeof(id_time_stamp), ZMQ_DONTWAIT);
-      sessions.erase(session);
+      bool success = this->client.send(request->second, ZMQ_DONTWAIT | ZMQ_SNDMORE);
+      if(success) success = this->client.send(static_cast<const void*>(""), 0, ZMQ_DONTWAIT);
+      if(success) {
+        auto session = sessions.find(request->second);
+        for(auto id_time_stamp : session->second.enqueued)
+          interrupt.send(static_cast<void*>(&id_time_stamp), sizeof(id_time_stamp), ZMQ_DONTWAIT);
+        sessions.erase(session);
+      }
     }
     requests.erase(request);
     return true;

[comment edited to get the patch formatted properly]

@andreasjunghans amazing work! ill give this a test when i have some time and we can get it merged in!

@andreasjunghans thank you so much, gonna try this now!

Applying your patch and linking Valhalla to it made my server timeout issues disappear, thanks a lot!