whatyouhide/xandra

`Xandra.Cluster` connects, but doesn't think so

Closed this issue · 9 comments

I'm still digging to try to figure out what is going on with our inability to connect with Xandra repeatably. Any pointers appreciated!

We are now running 0.16.0 with the peers patch that is on the Draft PR.

I deployed a copy of our service that I could manually start in the container with start_iex. I ran tcpdump to capture the network traffic. What I see is that the first connection that goes out successfully connects. It stays connected the whole time. Many other connections are made, all of which seem to establish a TCP session. I cannot see anything different at the TCP level between when Xandra succeeds in starting the cluster and when it fails. Because it is over SSL (and I can't change that) I cannot see what is going on at the Cassandra protocol level. I assume if there were an actual protocol issue that I would see Cassandra sending RSTs or FINs on the connection, which I don't see.

Attached is a screenshot of Wireshark displaying the first connection. You can see by the packet count that it stays connected until I exit the application.

I built a little startup watcher GenServer as a child of Application that prevents our processing pipelines from starting until Xandra.Cluster is up. It tells you how long it waited and how many tries that took. There seems to be no pattern in the amount of time it takes.

I do not believe this is server-side because we run golang-migrate on every app startup and it has no trouble connecting to Cassandra at all, only a few seconds before Xandra starts up.

Things I noticed:

  • I set the topology_refresh_interval to 30 seconds as a debugging experiment, that may be the thing that sometimes kick it in the pants if it doesn't connect the first time. Anecdotally there is a correlation timestamp-wise.
  • When I set the CPU limit very low, it almost always is able to connect right away

My observations may point to some kind of timing issue. But I am not sure where to look. Something with internal messaging between the BEAM processes?

Screen Shot 2023-07-31 at 2 30 31 PM

I realized that there is telemetry not being logged, so adding a logger for the telemetry around connections.

The telemetry reveals that it is not connecting to more than one host in the cluster. Additionally, every time the topology runs, it swaps in and out the two hosts that are not the one with the control connection. This appears to have no effect on creating or stopping connections, however.

Here is the startup of the application in a data center with 3 nodes.

1-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:02.618 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_added, host: %Xandra.Cluster.Host{address: {10, 4, 0, 86}, data_center: "redacted", host_id: "425bee92-19ab-4be1-99c9-392ce820670a", port: 11172, rack: "usw2-az3", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:02.618 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_added, host: %Xandra.Cluster.Host{address: {10, 4, 0, 42}, data_center: "redacted", host_id: "ea819914-e415-4a70-8726-9d6c91653ffa", port: 11172, rack: "usw2-az2", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:02.619 [info] Xandra Cluster: [:control_connection, :connected]: %{cluster_name: Store, cluster_pid: #PID<0.5486.0>, host: %Xandra.Cluster.Host{address: {10, 4, 0, 9}, data_center: "redacted", host_id: "473a0351-8c81-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:03.130 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_down, host: %Xandra.Cluster.Host{address: {10, 4, 0, 9}, data_center: "redacted", host_id: "473a0351-8c81-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:03.145 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_down, host: %Xandra.Cluster.Host{address: {10, 4, 0, 86}, data_center: "redacted", host_id: "425bee92-19ab-4be1-99c9-392ce820670a", port: 11172, rack: "usw2-az3", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:03.535 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.589 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.706 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.723 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.753 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.790 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.883 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.906 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.907 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.909 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.927 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.934 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.944 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.950 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}
11:27:03.951 [info] Xandra: [:connected]: %{address: {10, 4, 0, 42}, connection_name: nil, port: 11172}

It establishes the control connection to 10.4.0.9. It then adds 10.4.0.86 and 10.4.0.42. It then marks 10.4.0.9 down even while the control connection is still connected. It then marks 10.4.0.86 as down. Then the only host left is 10.4.0.42 so it opens all the connections to that host. However, the control connection remains open to 10.4.0.9 and the other two hosts are actually up.

From then on, when the topology refresh happens, it oscillates the other two hosts into and out of the cluster like this:

11:47:03.166 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_down, host: %Xandra.Cluster.Host{address: {10, 4, 0, 9}, data_center: "redacted", host_id: "473a0351-8c81-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:47:03.678 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_down, host: %Xandra.Cluster.Host{address: {10, 4, 0, 86}, data_center: "redacted", host_id: "425bee92-19ab-4be1-99c9-392ce820670a", port: 11172, rack: "usw2-az3", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:52:02.658 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_up, host: %Xandra.Cluster.Host{address: {10, 4, 0, 9}, data_center: "redacted", host_id: "473a0351-8c81-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:52:02.659 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_up, host: %Xandra.Cluster.Host{address: {10, 4, 0, 86}, data_center: "redacted", host_id: "425bee92-19ab-4be1-99c9-392ce820670a", port: 11172, rack: "usw2-az3", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}

Something very odd is going on with the cluster discovery.

Hey Karl! Sorry for the late response, I've been a busy bee with other stuff. So, it looks like this is good up to

11:27:02.618 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_added, host: %Xandra.Cluster.Host{address: {10, 4, 0, 86}, data_center: "redacted", host_id: "425bee92-19ab-4be1-99c9-392ce820670a", port: 11172, rack: "usw2-az3", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:02.618 [info] Xandra Cluster: [:change_event]: %{changed: true, cluster_name: Store, cluster_pid: #PID<0.5486.0>, event_type: :host_added, host: %Xandra.Cluster.Host{address: {10, 4, 0, 42}, data_center: "redacted", host_id: "ea819914-e415-4a70-8726-9d6c91653ffa", port: 11172, rack: "usw2-az2", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}
11:27:02.619 [info] Xandra Cluster: [:control_connection, :connected]: %{cluster_name: Store, cluster_pid: #PID<0.5486.0>, host: %Xandra.Cluster.Host{address: {10, 4, 0, 9}, data_center: "redacted", host_id: "473a0351-8c81-43c4-8a79-9755134a17d4", port: 11172, rack: "usw2-az1", release_version: "4.0.4", schema_version: "724335b0-8709-3d1d-8b94-233dfe8fbce5"}

but then it goes haywire. I think I know what's going on: we introduced a "health check" for connected hosts, since Cassandra's internal change events are basically garbage. So, I think what's happening is that it's not quick enough to connect to the nodes, so the next time we do this health check, most nodes sort of report as "down", which causes the :host_down event. If you reduce CPU, it's probably not fast enough to do the health check so it doesn't disconnect.

I'll take a look at merging #309 soon, and then try to fix this. I'll let you know when you'll have something to try out!

@relistan one question: if you query 10.4.0.9 (the control connection node) for system.peers, you see the other peers, right? You should have a [:xandra, :cluster, :discovered_peers] Telemetry event for that.

Hey @relistan! I've been hard at work rewriting Xandra's clustering functionality. Wooph 😓

It should be more reliable now, but it's painful as hell to test this locally with small synthetic clusters. I’m working with folks at Veeps (where I work) to be able to better test this as well, but I wanted to tell you to give main a try if you want and see how it goes 😉

Thanks @whatyouhide! I'm out until the middle of next week but maybe @britto or @jacquelineIO can take a look before that if they have time.

That would be fantastic, as it would also help me figure out if something is wrong and start working on fixes. @britto and @jacquelineIO, please reach out if I can help in any way!

Xandra has changed so much in the past couple of months that I’m gonna go ahead and say that this is fixed 😄 It might not be, but I’m pretty sure any error would look different at this point!