How do I know that FSM.Apply has completed on a majority of nodes?
bootjp opened this issue · 4 comments
Thank you for the wonderful library.
I really like hashicorp's OSS because it's simple and robust.
By the way, I want to know that FSM.Apply has completed on the majority of nodes after FSM.Apply has been called.
Is there this way?
Specific cases where this might be necessary include.
Assume that immediately after FSM.Apply writes data, the leader node crashes and the leader is transferred to another no.
In this case, we assume that to avoid losing data written by FSM.Apply, we might wait for FSM.Apply to be called on a large number of nodes.
Thank you for the wonderful library.
I really like hashicorp's OSS because it's simple and robust.
Thank you
I want to know that FSM.Apply has completed on the majority of nodes after FSM.Apply has been called.
Is there this way?
There is no way to do exactly this in the library... but read on!
Assume that immediately after FSM.Apply writes data, the leader node crashes and the leader is transferred to another no.
In this case, we assume that to avoid losing data written by FSM.Apply, we might wait for FSM.Apply to be called on a large number of nodes.
This is true - good catch. The way we deal with this though is not by reading from a quorum to check they have all applied. Instead the library provides two things:
-
Thanks to the raft specification, a newly elected leader must have the most up to date logs - i.e. it has every comitted write in the cluster in logs already (though not necessarily applied to it's FSM).
-
We have a method called
Barrier
which is designed to be called by the leader:Lines 822 to 846 in 9174562
Barrier will effectively push a non-op write through Raft right through to the FSM apply. Since the write can't apply until all writes committed before it has been applied it guarantees that the FSM is up-to-date after a leadership change. Most users of this library should call this as soon as a new leader is established in the cluster, before that leader starts accepting new writes (or consistent reads) to ensure that not only does it have all the committed logs, but that they have actually been applied already. You can see this in Consul right here at the start of our leader loop: https://github.com/hashicorp/consul/blob/a02e9abcc17ec4cea09e8ee4dec300e36bd8ac7b/agent/consul/leader.go#L158-L165
Does that help?
I see.
Please allow me to continue asking questions because I still don't understand some of them.
- New leaders always have up-to-date logs.
- Raft Log is stored in stable storage when Barrier is called
Does this mean that whenever Barrier is called using the 1 and 2 mechanisms, the Raft Log is stored in the follower's stable storage, the newly elected leader restores data from the latest Raft Log, and no data is lost?
Does this mean that whenever Barrier is called using the 1 and 2 mechanisms, the Raft Log is stored in the follower's stable storage, the newly elected leader restores data from the latest Raft Log, and no data is lost?
More or less yes. Specifically because of point 1 (the raft spec) in order for the new leader to have been elected by a quorum at all, it is already true that it has the most up-to-date log containing at-least every committed log entry that might have been acknowledged by any previous leader. So its log on disk is already "complete".
The Barrier
call is effectively just a no-op write operation which is written to the leaders log, replicated to followers in the normal way and marked as committed by the leader only one a quorum of the cluster has accepted it. Only after it's committed to a quorum of logs in the cluster will it be applied to the new leader's FSM and the Barrier
call will block until that apply completes.
The Barrier apply itself is a no-op, but because committed log entries are only ever applied in order, waiting until this new "write" has been applied to the FSM ensures that every previous write that was ever committed by a previous leader is also now applied to the new leaders FSM and so it may proceed with accepting new requests with a fully consistent local state.
I see.
Thank you for your kind explanation.