PikaLabs/floyd

Down or slow nodes cause poor availability

Closed this issue · 2 comments

Down or slow nodes will cause huge drops in performance, which becomes more and more serious as time goes by.

The PeerThread for abnormal nodes need more and more time to get entries which growth over time.

Meanwhile, the global context mutex hold by it and prevent any other PeerThreads from advancing commit index.

So the performance goes worse and worse at the perspective of user, and finally not available.

there is two choices here:

  1. add sleep time after the peerthread send failed
  2. we should limit the number of entries that every time peerthread get from log, but this choice may affect the performance, since the larger size the batch is, the better performance floyd will get. And right now the default entries size is 10M, we shouldn't reduce this size.

So I suggest add sleep time after each failed AppendEntry operation.

How about just move the GetEntry loop out of the protection of context global mutex, as shown in 01a6979, issue29 branch

So that the poor performance will only be problem of the slow node, and has nothing to do with the whole system availability.