Periodic refresh of a DOWN node causes routing to downed node

Question

Periodic refresh of a DOWN node causes routing to downed node

Closed this issue a year ago · 9 comments

Currently, when a node is down and we do a periodic topology refresh, we consider the node as UP and perform a healtcheck, and before we conclude the healthcheck, the new connection pool for the downed node is already being routed some queries, which causes noise in the logs due to connection failures. We should be waiting for the healtcheck to be completed before we route queries. See #301

Answer 1 · 2023-04-20T10:45:21.000Z

@harunzengin I'm not sure how we can do that, because it's a chicken and egg problem, no? In order to perform the healthcheck, we need to start the pool, so that connections register in the registry. But if the node is down, should we start the pool at all? Maybe we can start the pool and just not route queries to it... Thoughts?

Answer 2 · 2023-04-20T11:56:34.000Z

Yeah, it was our idea as well to maybe not route anything before the healthcheck is completed. And since the healthcheck is basically to wait a fixed amount of time n before looking at the Registry whether the process registered itself, we could maybe not accept any queries in the load balancing strategy for t>n?

Answer 3 · 2023-04-20T14:33:08.000Z

Maybe we should just mark pools as ready or not in the cluster, and then only pick pools that are up? Here.

Otherwise, we can think of marking the host as "up" in the load-balancing plan only once it passes the health check?

Answer 4 · 2023-04-21T09:41:31.000Z

So in addition to :up and :down in the state, having a :ready, :not_ready state you mean?

Shall we introduce a new message {:healtcheck, :pass, host} and {:healtcheck, :fail, host} from control_connection to cluster then?

Answer 5 · 2023-04-21T13:24:42.000Z

@harunzengin I don't think we need a new state. After all, a host is "up" when we can route queries to it, right? Could we consider the host as added and "down" until we get the healthcheck message?

Answer 6 · 2023-04-24T12:27:06.000Z

Currently, the control connection sends a :host_up message to the cluster whenever it thinks that a node is up, this can be from cassandra gossip, or from the system peers table. In either case, the cluster process starts the connection and sends the control connection a {:healthcheck, host} message. If the control connection thinks the node is down, it sends a :node_down message to the cluster.

Now, we need to change this so that the control connection sends sth. like a :host_init message instead of :host_up. The cluster initializes the connection, then sends a {:healthcheck, host} message to the control connection. The control connection sends a :host_up or :host_down message afterwards. Thoughts? @whatyouhide

Answer 7 · 2023-04-24T14:41:41.000Z

Yes exactly, that's what I was thinking of: using a new :host_reported_up (I like this name in particular) message. This would tell us that a host is in the peers table or reported as up, but then we'd only send :host_up or :host_down based on the healthcheck. Does this make sense?

Answer 8 · 2023-04-24T14:42:44.000Z

Yeap, that makes sense. I can start working on this then 👍

Answer 9 · 2023-04-24T14:43:23.000Z

Awesome, thanks @harunzengin! 🙏