lni/dragonboat

How do we debug why "timeout" issues are happening

namanchikara opened this issue · 1 comments

Hi @lni, firstly thanks for the contribution of Dragonboat to the OSS world! We're currently using Dragonboat on one of the applications that's supposed to be highly scaleable, we've got 5 nodes (4 core CPU, 16 GB ram each), and we're using BadgerDB for the state machine.

Now to make it fail-safe and to have a "recoverable" system we're trying to test the system in different scenarios. One such scenario is where we bring one of the nodes (out of 5) down, and then while transactions are flowing in (around 2k TPS) we wait for a minute or so and then bring that node back up. If my understanding is correct, this node is supposed to recover by identifying who the leader is and how far behind it is from the leader, if no snapshot(s) has/have been created so far then the leader will send the logs to this node otherwise it will send the snapshot.

What we're currently witnessing is, in the above scenario, in the application logs of the new node we get log messages:

error timeout: shard is not ready

What's more interesting is if we:

  1. Start the run (2k TPS)
  2. Bring down one node
  3. Stop the transactions after a while
  4. Bring the node back up
  5. Start the transactions after the node is back-up

Then it's able to recover and process the transactions.

The only difference is the step 3, if we don't stop the transactions and try to bring it back up (which would be the production use case) then we're seeing the timeout issues. I hope I'm able to convey what we're trying to do and hoping you can help us with some pointers on how we can debug it further.

From the logs, it also seems like it's from the SyncPropose method. We have info level logs enabled on the dragonboat and only error level for our application and BadgerDB, please let me know if you need any more info from our end.

lni commented

If my understanding is correct, this node is supposed to recover by identifying who the leader is and how far behind it is from the leader, if no snapshot(s) has/have been created so far then the leader will send the logs to this node otherwise it will send the snapshot.

you are correct.

error timeout: shard is not ready

did that message just go away after a while? my understanding is that you were trying to use that recovered node when it is still in the process being recovered (not ready yet).

you may want to check why it is taking longer than expected, probably slow state machine recovery?