PowerDNS/weakforced

[BUG] Unavailable TCP Siblings cause allow/report commands to wait for several minutes

neilcook opened this issue · 2 comments

Describe the bug
Currently UDP and TCP replication take place in the worker thread pool. For UDP this wasn't really a problem because UDP message sending is fast and doesn't delay the worker thread. However for TCP, if a sibling is unavailable the reconnect logic means that the worker thread waits until the connect either succeeds or times out. This happens for every replication operation triggered by the command. While all of this is happening, the command (typically report but could be allow) is blocked until all the messages are sent or all the connects time out. This is unacceptable. Replication should happen in a separate thread pool that does not affect the allow/report operation latency.
To Reproduce
Steps to reproduce the behavior:

  1. Setup a wforce server with a TCP Sibling that does not exist (addSibling("127.0.0.1:2929:tcp"))
  2. Enable replication on the stats DB
  3. Issue a report command over the REST API
  4. Watch the report command hang for 2 minutes while the TCP reconnect logic goes through its thing.

Expected behavior

  1. A separate thread pool would mean that allow/report commands don't have to wait for replication to occur.
  2. However even if replication happens in a separate thread pool, one (or two or however many) invalid/dead Siblings shouldn't block replication from happening to alive siblings. Perhaps there should be a dedicated thread to each Sibling, and a separate queue for replication requests to each thread. That way even if a dead sibling replication queue fills up, the live siblings still get replication messages.

Fixed in #285

Fixed in 2.2