reddit/baseplate.py

Server-side timeout

Closed this issue · 4 comments

We should be able to set a maximum length of request. If the greenlet takes longer, it gets killed. This means we don't let greenlets choke up the server for too long.

Initial pass looks something like this:

--- a/baseplate/server/thrift.py
+++ b/baseplate/server/thrift.py
@@ -1,3 +1,4 @@
+import functools
 import socket
 
 from typing import Any
@@ -5,6 +6,8 @@ from typing import Dict
 from typing import Tuple
 from typing import Union
 
+import gevent
+
 from gevent.pool import Pool
 from gevent.server import StreamServer
 from thrift.protocol.THeaderProtocol import THeaderProtocolFactory
@@ -59,12 +62,14 @@ def make_server(server_config: Dict[str, str], listener: socket.socket, app: Any
         server_config,
         {
             "max_concurrency": config.Integer,
+            "response_timeout": config.Timespan,
             "stop_timeout": config.Optional(config.Integer, default=0),
         },
     )
 
     pool = Pool(size=cfg.max_concurrency)
-    server = GeventServer(processor=app, listener=listener, spawn=pool)
+    spawn = functools.partial(pool.spawn, gevent.with_timeout, cfg.response_timeout.total_seconds())
+    server = GeventServer(processor=app, listener=listener, spawn=spawn)
     server.stop_timeout = cfg.stop_timeout
 
     runtime_monitor.start(server_config, app, pool)

Problems:

  • The asymetry of the timeout option types is weird.
  • The stack trace is horrific. Maybe a custom timeout exception would help?

Related question: Once I define this and I start seeing this timeout occur, how will I be able to track down the cause? Would there be any way I can log and include pending http/thrift/cass requests?

That implementation is bad. It's timing out whole connections not individual requests. Will keep looking into this.

@curioussavage: Absolutely will make sure the telemetry is there to diagnose the timeout. At the minimum, the timeout exception is dumping stack traces for the current and other greenlets so you can see what we were waiting for. More to come.

This is the basics of a working implementation: master...spladug:request-timeout2

but there are two big questions:

  • we should be able to configure global (all endpoint) and endpoint-specific timeouts in the config file (timeouts might vary in different configurations of an app, like in different pools with different types of clients). how do we get that information into the right part of the framework?
  • this necessarily uses gevent directly where we haven't done so before outside the server code. is that ok? is there any way to avoid that?