PowerDNS/weakforced

parseTCPReplication error after restart of a SyncHost [BUG]

mgehlmann opened this issue · 9 comments

Describe the bug
We have a 2 host cluster, where the two nodes are configured as each others Sibling+SyncHost.
Weakforced version: 2.4.0 built with builder.sh

The Sibling replication seems to work flawlessly. After restarting one of the nodes, however, in about half of the cases the synchronization fails after only a portion of the entries have been transferred with this error message:
parseTCPReplication: error reading size, got 1 bytes, but was expecting 2 bytes

Restarting the same node again right after sometimes leads to a full synchronization, sometimes to the same error. (See logs below).

To Reproduce
Steps to reproduce the behavior:

  1. Configure 2 node cluster with HOST_A as Sibling and Sync host of HOST_B and vice versa.
  2. POST reports to either of the nodes
  3. Restart one of the nodes
  4. If the error does not occur, repeat restarting the same host. The error occurs quite frequently.

Expected behavior
Data is fully synchronized every time.

Screenshots

OS (please complete the following information):

  • OS: Debian
  • Version: stretch

Additional context
Log output of HOST_A and HOST_B. After restarting HOST_B
HOST_A logs attempt 1 (no error):
Synchronizing DBs to: IP_HOST_B:4001, will notify on callback url: http://IP_HOST_B:8085/?command=syncDone
Synchronizing DBs to: IP_HOST_B:4001 was completed. Synced 3785 entries.
Synchronizing DBs callback to: http://IP_HOST_B:8085/?command=syncDone was successful

HOST_B logs attempt 1 (no error):
WforceWebserver launched on 0.0.0.0:8085
Accepting control connections on 127.0.0.1:4004
Launched UDP sibling replication listener on IP_HOST_B:4001
Launched TCP sibling replication listener on IP_HOST_B:4001
checkSyncHosts: uptime: 55935 returned from sync host: http://IP_HOST_A:8085/?command=stats
checkSyncHosts: Successful request to synchronize DBs with sync host: IP_HOST_A:8085
New TCP Replication connection from IP_HOST_A
Received 3785 Replication entries from IP_HOST_A

Log output of HOST_A and HOST_B. Second restart produces error:

HOST_A logs attempt 2 (error occurs):
Synchronizing DBs to: IP_HOST_B:4001, will notify on callback url: http://IP_HOST_B:8085/?command=syncDone
Synchronizing DBs to: IP_HOST_B:4001 did not complete. [Network Error: Writing to a socket: Connection reset by peer]
Synchronizing DBs callback to: http://IP_HOST_B:8085/?command=syncDone was successful

HOST_B logs attempt 2 (error occurs):
Launched UDP sibling replication listener on IP_HOST_B:4001
Launched TCP sibling replication listener on IP_HOST_B:4001
checkSyncHosts: uptime: 56246 returned from sync host: http://IP_HOST_A:8085/?command=stats
checkSyncHosts: Successful request to synchronize DBs with sync host: IP_HOST_A:8085
New TCP Replication connection from IP_HOST_A
parseTCPReplication: error reading size, got 1 bytes, but was expecting 2 bytes
Received 1017 Replication entries from IP_HOST_A

Third restart worked again without error. Output is the same as for attempt 1

Thanks for the report, I will look into it. I might need to ask for more information, will let you know.

So can I ask, does this issue only happen when there are active report/allow commands against host A, or can you also reproduce it when there are no active report/allow commands against host A? This will help me track down where the potential issue is.

Once #343 passes CI, can you try building from that, and see if it fixes your issue? I haven't reproduced it, but looking at the code, and your logs, I think it will.

Great, thanks! I'll try and let you know.

You can ignore the fact that the docker image CI fails currently - that's not relevant

I tested and the Syncing works perfectly now! I'll close the issue.
Thanks for the quick fix!

JFYI this fix will be in 2.4.1 (due imminently).

Also, a piece of advice, depending on how many entries you have in your stats DBs, you might want to use sharded stats DBs. Not only does that reduce lock contention when reading/writing, but it also reduces the amount of time that each DB is locked when syncing, which can prevent updates from being dropped. If you don't have very big stats DBs (in your example I saw 56K which is pretty small) then don't worry about it.

Great, thanks! 56K was actually only the number of entries until the error occurred. There were a few million in total. (ATM just simulating requests to test it). Currently I'm at 20 shards. Besides the increased memory usage, would you expect issues when increasing the number of shards way further?