RAFT leadership transfers and health check failures [v2.10.22]
Opened this issue · 12 comments
Observed behavior
We've observed frequent RAFT leadership transfers of the $MQTT_PUBREL consumers and health check failures, even in a steady state. Occasionally, these issues escalate, causing sharp spikes in leadership transfers and health check failures, which lead to cluster downtime.
During these intense spikes, metrics from NATS Surveyor show an enormous surge in system messages, with counts reaching billions of messages per minute (metric name: nats_core_account_msgs_recv).
System details
- Peak load of 5k MQTT clients, each with 2 QoS 2 subscriptions, totaling 10k subscriptions across 10k MQTT topics.
- Messages produced at ~10 RPS
- A single NATS queue group subscription is used to consume MQTT-published messages on one topic.
Additional details
- Cluster of 3 nodes
- max_outstanding_catcup 128MB
Associated logs:
- RAFT [cnrtt3eg - C-R3F-yMOeq7kb] Stepping down due to leadership transfer
- Falling behind in health check, commit 3202757 != applied 3202742
- Healthcheck failed: "JetStream is not current with the meta leader"
nats traffic
in steady state (taken minutes after starting the pods) :
nats-traffic-of-sys-account.txt
Expected behavior
No leadership transfers of consumers & no health check failures in steady state.
Server and client version
Nats Server version 2.10.22
Host environment
Kubernetes v1.25
Steps to reproduce
Setup a 3 node NATS cluster, start 5k MQTT connections with 10k (2 per each client) QOS 2 subscriptions and publish QOS 2 messages at 10 RPS.
Can you please provide more complete logs from around the times of the problem, as well as server configs?
Do you have account limits and/or max_file
/max_mem
set?
Normally the only things that should be causing leader transfers on streams in normal operation is a) if you ask it to by issuing a step-down, or b) if you've hit up against the configured JetStream system limits.
@neilalexander We do not have any account level limits. max_file_store is 50GB and max_memory_store is at 10GB.
Have shared the config file and complete logs over email. Let me know if you want any additional details.
I've taken a look at the logs you sent through but it appears as though the system is already unstable by the start of the logs? Was there a network-level event leading up to this, or any nodes that restarted unexpectedly?
@neilalexander We didn't observe any network-level events. The nodes did restart due to health check failures. I've sent you another email containing additional logs from an hour before the instability occurred. Let me know if that helps or if you have any additional queries
I am going to try reproducing this from the MQTT side. The QoS2-on-JetStream implementation is quite resource intensive (per sub, and per message), this kind of volume might have introduced failures, and ultimately blocking the IO (readloop) waiting for JS responses before acknowledging back to the MQTT clients, as required by the protocol.
@levb have shared the config file with Neil. Let me know if you need any additional inputs in reproducing this. Can jump on a call as well if required.
@levb @neilalexander My hunch is that the huge amount of raft sync required for R3 consumers might be causing the instability in the system. Even in steady state scenario we have 2Mil system messages per minute. Let me know your thoughts on this?
@derekcollison Do we have any plans to support R3 file streams with R1 memory consumers?
@derekcollison
I believe the consumer_replicas
setting under the MQTT config is currently not in use (server ignores this config, see this), and that the consumer replicas are instead aligned with the parent stream replica for interest or workqueue streams ( source )
Additionally, we have already set consumer_replicas
as 1 in our production cluster, and I can see that the consumers still have a raft leader, which wouldn't be the case if this consumer replica override config were functional.
Do we have plans to re-introduce this consumer replica override capability?
It will work but yes if there are retention based streams backing the MQTT stuff the system will override and force the peer sets to be the same.
This QOS2?
@derekcollison this ticket is, but @slice-arpitkhatri said they got into this state with QoS1 as well,
Yes, have faced the issue with both QOS 1 and QOS 2.