hashicorp/raft

not storing logs at leader

magnoyu opened this issue · 8 comments

Hi, I have a quick question.. let's say I have a 5 nodes clusters. Let's say I don't store logs at the leader, but I store them at all the 4 followers. What is the implication during recovery? As an example, let's say we have persisted log

Leader Node 1 log: empty
Node 2 log: index 21
Node 3 log: index 32
Node 4 log: index 45
Node 5 log: index 55

Let's say the leader has received up to index 55 and never store any logs.
In terms of commit, the cluster has committed and processed up to index 45.
Let's say the cluster are all terminated and now we want to recover.
Is there a way to recover from index 45 although that's only the 2nd (out of 5) latest persisted log, instead of from index 32, which I think normally that's what the cluster would have done?
The hypothesis is, follower nodes cannot have received any log without the leader first receiving it. So I could have copied the node 5 logs to node 1 and start the recovery... but I wonder if this is something configurable such that I don't need to do the manual copy.

@banks could you help out?

banks commented

Hey @magnoyum i wrote a long reply about this but realised I don't think I understood.

Are you asking specifically about the use of RecoverCluster? When you say I don't store logs at the leader, what do you mean? Our implementation doesn't ever replicate logs until it's fsynced them to disk so this can't happen normally, but it could say if the old leader's disk failed and was replaced by an empty one - is that the case you are asking about?

The comments in RecoverCluster I think cover this, let us know which parts you don't know how to apply to your situation!

raft/api.go

Lines 272 to 300 in 9174562

// RecoverCluster is used to manually force a new configuration in order to
// recover from a loss of quorum where the current configuration cannot be
// restored, such as when several servers die at the same time. This works by
// reading all the current state for this server, creating a snapshot with the
// supplied configuration, and then truncating the Raft log. This is the only
// safe way to force a given configuration without actually altering the log to
// insert any new entries, which could cause conflicts with other servers with
// different state.
//
// WARNING! This operation implicitly commits all entries in the Raft log, so
// in general this is an extremely unsafe operation. If you've lost your other
// servers and are performing a manual recovery, then you've also lost the
// commit information, so this is likely the best you can do, but you should be
// aware that calling this can cause Raft log entries that were in the process
// of being replicated but not yet be committed to be committed.
//
// Note the FSM passed here is used for the snapshot operations and will be
// left in a state that should not be used by the application. Be sure to
// discard this FSM and any associated state and provide a fresh one when
// calling NewRaft later.
//
// A typical way to recover the cluster is to shut down all servers and then
// run RecoverCluster on every server using an identical configuration. When
// the cluster is then restarted, and election should occur and then Raft will
// resume normal operation. If it's desired to make a particular server the
// leader, this can be used to inject a new configuration with that server as
// the sole voter, and then join up other new clean-state peer servers using
// the usual APIs in order to bring the cluster back into a known state.
func RecoverCluster(conf *Config, fsm FSM, logs LogStore, stable StableStore,

Thanks for your reply. Actually let me describe what I have in mind.

  1. Let's say as an example, I have a 5 nodes cluster
  2. Leader asyncly call StoreLog(s) (through a new channel), as opposed to currently the leader's StoreLog is called synchronously
  3. When the leader's StoreLog returns, then leader call commitment.match.
  4. In the meantime, leader notify followers of new log.

The reason I'm asking is for performance where currently the leader's StoreLog is in the hotpath.. When fsync is slow randomly for a few cases (for reason I don't really understand.. ), all the processing are delayed. However, if we allow leader to call StoreLog asyncly and if that call is slow/blocked for some IO reason and we already have 3 followers response OK, that should be sufficient to proceed?

If at any point we need to recover cluster, there should be at minimum 3 nodes that has the committed logs (even though it's possible all those 3 nodes were followers before but raft doesn't require to know who was the leader before when it recovers?)

banks commented

Ah I see.

Yeah every commit goes to leader logs first before being replicated in our implementation. It's technically possible in Raft to write the leader's log in parallel with replication - this is an optimization detailed in Diego's full Raft thesis (section 10.2.1). but to implement it would mean major changes to this library. It's something that while I'd personally be interested in doing it, I don't think it's enough of a pain point for our internal use-cases to warrant the large engineering effort it will take to test and be confident in a major change to our raft lib!

Etcd's raft library IIRC can make use of that optization so that might be a consideration.

All that said, if you don't fsync every log write then the correctness of Raft is much harder to reason about. All of the formal proofs of raft's correctness and all of the recovery procedures are based on the assumption that disk writes once confirmed are never lost.

Depending on how important your data is, you might be OK with occasional data loss.

In other words, there is no correct (in the sense of linearizable under all situations) way to use this library without Fsync on every single LogStore append. If you choose to anyway for performance reasons that may be suitable but it goes against all the design assumptions in the protocol and library so you will have to do a lot of reasoning for yourself to understand the risks and failure modes involved!

There are notable users of Raft (though not this lib I don't think) that choose not to fsync and instead rely on replication to mask failures. Nats.io's Jetstream does this as far as I can tell for example. But I've never seen a rigorous analysis of the correctness of doing so - it's certainly outside of what the Raft thesis assumes!

Another option to look at: TiKV implemented a feature in their raft version (in Rust) that allows all servers to delay fsyncing for some amount of time to amortize the cost. From a quick look it seems like they modified the protocol so the leader tracks which entries are actually synched though which seems like a big departure from raft to me (though I didn't look in detail). I've considered a similar oprimization in this library before where the LogStore interface instead returns a future which is only completed after next fsync so we can continue with replication with fsync but we can still wait and be sure about fsync when deciding that an entry is comitted preserving safety. But as mentioned it would be a complex change with lots or risk due to subtly changing the assumptions through the system!

@banks thanks for the detailed explanation. We were seeing some notable performance improvement in the changes we made , but unsure about the risk induced. (To clarify: we changed so Leader doesn't do fsync , only followers do) and it brings about 10%-15% latency gain

My question is, what is the best way for us to evaluate the risk introduced? is there some automated test/process we could run to check if we will see any data loss with this change?

banks commented

what is the best way for us to evaluate the risk introduced?

That's quite hard to answer! As far as I know Raft has only been proven correct assuming every log append is actually persisted on disk. It is only specified for a model where a node that has crashed and restarted comes back with either all the state it has previously acknowledged or none at all (and so is the same as adding a brand new node to the cluster).

To evaluate the possible outcomes of changing one of those fundamental assumptions requires very careful analysis and testing. Depending on the level of confidence you need, you might want to re-model in TLA+ to explore the full space of outcomes that might arise. You may wish to do Jepsen style "chaos" testing where you run simulated workloads with lots of node failures that loose disk etc.

Or you can choose to say "seems fine, the chance of a crash that looses unsynced data is low enough we'll just wait and see and recover from a backup if we really need to". It really depends on the criticality of your system 😄!

FWIW not fsyncing logs is less obviously bad than non fsyncing the Stable store where term/vote information is stored. If that is still calling fsync then at least you don't have trivial split brain issues where multiple nodes can be elected leader in the same term etc.

But it is more risky with our Raft implementation than with other ones because we assume that once leader has dispatched logs that they are "safe" and automatically count the leader as one of the quorum when we decide if a log is committed. If it's actually not been synched and is lost then it's possible that that write will be lost forever even though we might have "acked" back to the client that wrote violating Linearizability. If that's not a big deal in your case then fine but it certainly would fail a Jepsen style test for linerizability guarantees.

To actually observe a violation in real life might actually be quite hard but it will be possible. Off the top of my head something like this could occur:

Assume there is a 5 node cluster of servers A, B, C, D, E. Quorum size is 3, current leader is A.

  1. Nodes D and E get partitioned away The last index in their logs is 10. They can't do anything as they are in a minority partition.
  2. Nodes A, B, C continue to commit entries 11, 12, 13, but the leader A is not syncing them to disk.
  3. The datacenter power goes out and all servers stop
  4. Servers come back on line slowly as different racks are powered up
  5. Servers A, D and E come back first - the partition is healed as the network switch has reset
  6. Server A recovers only the logs that were fsynced by the OS which was only up to entry 9
  7. D and E have logs up to 10
  8. The three servers are enough to form a Quorum so they do and elect D as the new leader since it has the most recent logs.
  9. Server A the old leader is now a follower and replicates 10 again from the new leader
  10. The cluster is up and running just fine with no errors in logs... but it lost logs 11, 12, 13 which we already acked!
  11. Eventually B and C come back but their 11, 12, 13 entries are for an older term than the new leader and so they will be discarded because raft assumes that they can't possibly ever have been committed without being persisted on a quorum of disks, thus guaranteeing that the new leader also has them. This is the fundamental guarantee you have broken by not fsyncing leader logs.

That's one example off the top of my head. It might be a risk you're willing to take but it shows why this isn't "correct" in all possible cases! Remember too that "network partitions" can really mean any sort of transient network failure so even if this power sequence seems far fetched, it's hard to reason about every possible type of failure that could lead to an equivalent outcome - that's what model checking these algorithms and formal proofs are good for!

Does this help?

banks commented

FWIW I think a raft implementation that writes the leader logs in parallel with replication would perform closer to your no-fsync benchmark because, assuming that disk are only sometimes slow to fsync, on those occasions the latency is bound by the slowest in any quorum - so if the leader is slower than another follower you don't have to wait for the leader.

That's not to mention that you'll automatically get a major latency improvement from that optimisation anyway because it's now only 1 RTT and 1 fsync delay per commit instead of 1 RTT and 2 fsync delays.

But, as mentioned, that would come close to a full rewrite of the library so is unlikely to be something that is prioritized soon! If you are interested, you may want to consider alternative implementations!

If you do decide to go ahead with no fsync, you might consider just dropping fsync on all logs not just the leader for a much bigger improvement. You've already given up on Linearizability/strict consistency anyway so it's not that likely to be much worse if none of the servers fsync logs and you'd get a much bigger performance increase!

A final thought if you are still considering this: 15% latency improvement in the common case is not that big of an improvement to trade off against the correctness of the system IMO. If you decide that correctness is not important enough that 15% lower latency is better, it might be worth reviewing why you are using an "expensive" consensus protocol in the first place! Granted I don't know of many off-the-shelf Go libraries for implementing primary-backup replication or something else that's more efficient but without the consistency guarantees but if performance is so criticial to your application it might be worth considering that. Or even just using a SQL DB or something that does it for you!

banks commented

I'm going to close this as I think the original question was answered. The (correct) optimization discussed here about committing leaders logs in parallel is a topic I'm still interested in and have had some new ideas about but I will continue with those in a new issue as it's not the same topic that was raise here.