etcd-io/raft

Questions about log probe

kikimo opened this issue · 3 comments

kikimo commented

Look at the following piece of code, the test at line will make raft send at most maxMsgSize bytes of entries when raft is in probe state, ie. pr.State != tracker.StateProbe:

raft/raft.go

Lines 565 to 585 in 3e6cb62

func (r *raft) maybeSendAppend(to uint64, sendIfEmpty bool) bool {
pr := r.prs.Progress[to]
if pr.IsPaused() {
return false
}
lastIndex, nextIndex := pr.Next-1, pr.Next
lastTerm, errt := r.raftLog.term(lastIndex)
var ents []pb.Entry
var erre error
// In a throttled StateReplicate only send empty MsgApp, to ensure progress.
// Otherwise, if we had a full Inflights and all inflight messages were in
// fact dropped, replication to that follower would stall. Instead, an empty
// MsgApp will eventually reach the follower (heartbeats responses prompt the
// leader to send an append), allowing it to be acked or rejected, both of
// which will clear out Inflights.
if pr.State != tracker.StateReplicate || !pr.Inflights.Full() {
ents, erre = r.raftLog.entries(nextIndex, r.maxMsgSize)
}

I have two questions:

  1. Is this an expected behaviour(send at most maxMsgSize bytes of entries when raft is in probe state)?
  2. If this is an expected behaviour, why don't we send just one or empty entry when pr.State != tracker.StateProbe? Since in a probe state it's very likely for a append message to be rejected, sending just one or zero entry might accelerate the probe process.

@ahrtr @pavelkalinnikov

pav-kv commented
  1. Is this an expected behaviour(send at most maxMsgSize bytes of entries when raft is in probe state)?

Yes. This behaviour is "correct" either way. But there are options to save some bandwidth, as you point out, depending on assumptions. I think the current strategy optimistically assumes that the first probe will succeed, and in this case we will save one roundtrip of latency.

I can think of 2 cases when this probing happens:

  1. There is a stale follower who went offline for a while, and since then a few leadership changes and log suffix overrides happened. In this case it is likely that the first append message in the probing state won't succeed, and there will be a few roundtrips before the appends stabilize.
  2. During a normal leadership run there was a network hiccup, and one append message got lost. Leader will eventually get a reject, but it will probably successfully recover the flow of appends with a single probing message.
  • If this is an expected behaviour, why don't we send just one or empty entry when pr.State != tracker.StateProbe? Since in a probe state it's very likely for a append message to be rejected, sending just one or zero entry might accelerate the probe process.

What you're suggesting would be best for the case (1). The current strategy is better for case (2) in terms of replication latency. There is no obviously always-better option, it's a trade-off.

It's hard (but maybe not impossible) to distinguish between case (1) and (2) on the leader end either, to make this decision dynamic. So we stick to the optimistic, I guess. I don't have data though, to support the argument that this optimistic approach is best on average. I think it largely depends on the deployment.

pav-kv commented

Btw, see a related broader-scope issue #64 - this and similar user/workload/deployment-dependent flow control aspects could be delegated to the upper layer, and not necessarily hardcoded in raft package.

kikimo commented

Thanks for your reply, close this issue.