hashicorp/raft

Ability for peers to denote others/themselves as too laggy to accept requests

Closed this issue · 8 comments

There's an issue I've personally experienced between both Nomad and Consul that has popped up as issues in each products separately where an old peer that accidentally rejoins a cluster will immediately begin receiving requests where clients allow staleness. I would love for the raft library to provide some method for a cluster to say 'wow, this peer is very old. do not send any requests (including votes) their way.' It would be ideal if a peer could also go the other direction 'wow, I am very old. I should forward any rpc sent towards me, as my data is so egregiously out of date my information will be incorrect.'

To be clear -- this is not necessarily an issue with raft itself and it's default behavior. This is mostly an issue with operational teams where some peers can unexpectedly rejoin a cluster after a long period of time. The most common issue for me is when a very old host thought to be decommissioned is brought back online for whatever reason.

An example of how this manifests is hashicorp/nomad#18267. There are similar issues within the Consul repository.

The reason I believe this should live within the raft project itself is that Vault is also susceptible now that there's the internal storage, and I'd rather not have something like this happen in a security-critical bit of software, or try to open an issue within the Vault project that may or may not ever be accepted.

As a bit of design, I'm imagining something akin to what consul-template provides with it's max stale perimeters, but at a much lower level. Imagine a bit of HCL like...

server {
  raft {
    max_lag = 250000
  }
}

This would call to some function somewhere during peer membership where a peer could not only denote itself as behind, it would denote itself as so far behind that it's in an unknown state (I'm choosing this word as it's already a defined state). Then, a wrapper could be placed around any of the upstream libraries like go-immutable-radix to reject any queries into the datastore as it'd be marked as an unknown state, marking all current content invalid.

Is something like this feasible? Is it worth looking into at all? I'm poking through the code a bit and finding that staleness is recorded in lots of areas, however it doesn't seem to be used very much at the raft level itself.

This would probably save the entire Hashicorp user community a significant amount of headache otherwise I would never even bring this up.

What happens to quorum if a node demotes itself?

Disclosure I am not an expert here, simply someone who's been in the weeds a lot.

To answer your question straight up:

What happens to quorum if a node demotes itself?

I'd expect it to still exist as part of the raft peers, simply be unable to participate in any future votes (until healthy). Let's say we still have 4 out of 5 healthy servers, I wouldn't expect anything to happen to the quorum.

For some details; to the best of my knowledge, there are states within raft (at least within Consul and Nomad's implementation) where something is simultaneously too old to be a voter, so it is unable to correctly participate in quorum, and happily serving stale content to consumers.

Here's an example output of Nomad's operator autopilot health API (https://developer.hashicorp.com/nomad/api-docs/operator/autopilot#read-health), anonymized, from a real cluster:

    {
      "LastContact": "38.914708ms",
      "ID": "cafebeef",
      "Name": "funky-fresh-node",
      "Address": "10.0.0.5:4647",
      "SerfStatus": "alive",
      "Version": "1.50.0",
      "Leader": false,
      "LastTerm": 90,
      "LastIndex": 5000000,
      "Healthy": true,
      "Voter": true,
      "StableSince": "2023-10-12T22:13:18Z"
    }

This cluster is currently healthy so it's a counter-example for illustrative purposes.

However, if this cluster was under a lot of stress this same node may be "Healthy": false, and "Voter": false, (which the voter state will enter when unhealthy long enough, and not re-enter voter true until healthy long enough) and still serving stale reads. This is normally ok and we don't see any issues 99.999% of the time. That last 0.001% is a doozy, though.

This exact situation where this happens and the node is really behind (think billion+ logs behind) leads to some very interesting situations, and why I opened this issue. That node in question is going to see voter terms that are so far out of whack it's not going to be eligible to vote anyway. In fact, it may end up causing election storms (I have not seen this from this specific scenario, however it is possible).

I am not familiar with Nomad, to be clear.

I'd expect it to still exist as part of the raft peers, simply be unable to participate in any future votes (until healthy). Let's say we still have 4 out of 5 healthy servers, I wouldn't expect anything to happen to the quorum.

Yes, that is true. I just wanted to check that you understand the implications of removing nodes from a Raft system. What you want sounds like "pruning nodes", which Consul and rqlite implement:

https://rqlite.io/docs/clustering/general-guidelines/#automatically-removing-failed-nodes

The key thing to note is that this type of pruning is not done by Raft itself. The Hashicorp Raft implementation simply exposes the information, and it's up to systems built on Raft to decide if and when to prune.

banks commented

Hey @chuckyz, thanks for detailed request.

This is for sure a problem we've discussed before, however I don't think it's possible to solve it in the raft library layer unforutnately:

'wow, I am very old. I should forward any rpc sent towards me, as my data is so egregiously out of date my information will be incorrect.'

The crux of the problem here is that the raft library itself knows nothing about the application level RPC forwarding that you are describing. Raft is the internal library and does some internal RPCs but the "write something to raft" or "read something from the FSM" are left entirely up to the application to implement.

Exactly what those RPCs look like and the rules we use for forwarding writes and either forwarding reads or allowing stale/inconsistent reads is all Application logic and not something this library knows anything about so we can't really solve it just in this one place either.

It's a real problem, but I don't think there is a single fix for this here since it's a property of the application using this library and it's own consistency semantics that leads to the kinds of issues you are talking about.

Vault makes different decisions to either Consul or Nomad about this, although it's not immune. Any time reads from followers are used as a scaling/performance improvement leads to this issue but that's not a core part of raft - in fact that is an explicit violation of Raft's consistency model in order to get better performance.

One difference with Vault vs Consul is that Vault doesn't let any arbitrary read be made in a "non-consistent" way. It makes principled decisions around read paths that can tolerate local-only (and possibly state) reads. It also has mechanisms to ensure that tokens being used in an "inconsistent" read can detect when they do not yet have the right data locally to be enforced and wait out the raft inconsistency (or timeout) rather than make an inconsistent policy decision.

If anyone has any specific raft-level features that would make this easier to implement in applications I'd be intersted, but so far I've not thought of any short of massively increasing the scope of raft to include an opinionated framework for the whole application RPC and FSM usage too... but that would require rewriting all our products to adopt anyway!

The libarary already exposes all the information needed for applications to choose to implement the features you described so I'm not seeing an easy way to lower the burden of just implementing those in each application 🤔

banks commented

There is a second issue you brought up about an out-of-date peer not being able to vote. That it something we could potentially do here, but it's a departure from the Raft thesis and I've not yet seen how it would help.

Voting in raft isn't impacted by the staleness of data in a meaningful way. An old node that starts a vote is likely way behind in term so will just be ignored already. (There is a known case where a partitioned (but not stale/offline) server can disrupt a healthy leader needlessly that is being solved elsewhere by PreVote from the thesis, but it's not really what you are thinking of here.)

In addition raft already guarantees that every peer will reject a vote from a candidate who has much older logs than them so I don't see a way that a "stale" node demoting itself helps and it would certainly open up many new and complex situations that no one has studied before for the correctness of Raft!

too old to be a voter, so it is unable to correctly participate in quorum, and happily serving stale content to consumers.

This is referring to Autopilot functionality (a separate library and layer of automation in our products). You're right that autopilot will refuse to promote a new server to be a voter until it is caught up "enough", and that we currently do not have the equivalent logic for serving stale reads. But that logic around serving stale reads is something encoded in the applications and not either the raft or autopilot libraries.

So like you said - all the data is there, it's a question of how our applications use it.

It's also sadly not super simple - we've know about this (and chatted with you about it!) for a while, but each time we've tried to design a principled and robust solution, it's grown hairy super fast. I hope we'll still resolve it across all our products (although it's most obvious with Consul and Nomad) at some point, but it's sadly not super simple!

One example of why it's hard: we have had other features that do a similar thing before like the max stale constraints in Consul. These seem great - give me fast stale data unless it's older than X then get it fresh. The problem with this is that in almost every real-world case where we saw this functionality being used it caused or exacerbated outages 😱 ! In these cases, the reason the follower was stale was because the servers were overloaded! So what it actually achieved was just to make the overload worse because now all the cheap stale queries were being forwarded to the leader instead! In one memorable case it took a degraded cluster to completely offline after exactly the number of minutes they set as their "max stale" 😢 .

There are other cases too where this logic has undesirable side-effects. For example if the cluster has lost it's leader, Consul stale queries can still be served. If a large portion of your load is service discovery requests this kind of "fail static" is desirable, but if you have some mechanism that forwards requests or requires results be only a certain age then you fail closed instead and just return nothing until a leader is back. Again this is a solvable problem in theory - it's the equivalent of HTTP Cache-Control stale-if-error which Consul supports for agent caching, but it's an example of details that make it harder than just "always forward stuff if we are old".

I still hope we can solve this in our products though. We do have all the data needed to make better decisions that we currently do though it's hard to balance "how stale is OK" with "how much is it worth making cheap requests expensive to ensure freshness". Even if we had a way for operators to express that, would operators even be able to reason about that tradeoff?

banks commented

I should also add, in Consul one much simpler feature we did add in response to some issues similar to yours was the server_rejoin_max_age feature which ensures that servers that have been offline for a while (default a week) simply refuse to rejoin the cluster at all at any level (i.e. not even joining Serf) without explicit operator intervention.

That much more directly addresses the specific concern about old nodes coming back and disrupting clients with super old data and is much easier to reason about than the above complications of stale semantics!

Hi again @banks!

I'd say that's more than enough reasoning and context to say that here is the wrong place to solve this. Do you know where else I could open an issue like this other than directly in the raft code base? The radix db used for watches is probably also going to be incorrect for the same reasons as I believe it's unaware of things like raft state, which is where the application handles it.

I'll close this issue.