Metaswitch/clearwater-docker

Cassandra node failing

Closed this issue · 15 comments

My cassandra node appears to be failing, and repeatedly restarting. That in turn leads to other components failing. I'm a little puzzled by this, because I think I once installed and did not see this, but it now consistently gets into this state each time.

The cluster manager logs in /var/log/clearwater-cluster-manager end with something like the following:

15-11-2016 16:15:55.063 UTC ERROR client.py:817 (thread CassandraPlugin): Request to server http://10.0.1.14:4000 failed: MaxRetryError('HTTPConnectionPool(host=\'10.0.1.14\', port=4000): Max retries exceeded with url: /v2/keys/clearwater/node_type_cassandra/clustering/cassandra?waitIndex=19&recursive=false&wait=true (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'10.0.1.14\', port=4000): Read timed out. (read timeout=5)",))',)
15-11-2016 16:16:15.076 UTC ERROR client.py:817 (thread CassandraPlugin): Request to server http://10.0.1.14:4000 failed: MaxRetryError('HTTPConnectionPool(host=\'10.0.1.14\', port=4000): Max retries exceeded with url: /v2/keys/clearwater/node_type_cassandra/clustering/cassandra?waitIndex=19&recursive=false&wait=true (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'10.0.1.14\', port=4000): Read timed out. (read timeout=5)",))',)
15-11-2016 16:16:35.094 UTC ERROR client.py:817 (thread CassandraPlugin): Request to server http://10.0.1.14:4000 failed: MaxRetryError('HTTPConnectionPool(host=\'10.0.1.14\', port=4000): Max retries exceeded with url: /v2/keys/clearwater/node_type_cassandra/clustering/cassandra?waitIndex=19&recursive=false&wait=true (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'10.0.1.14\', port=4000): Read timed out. (read timeout=5)",))',)

However, I cannot see why it is repeatedly failing to read that, since manual reading of the value gives a result:

root@cassandra-884159819-emude:/var/log# etcdctl -C http://10.0.1.14:4000 ls --recursive
/clearwater
/clearwater/node_type_cassandra
/clearwater/node_type_cassandra/clustering
/clearwater/node_type_cassandra/clustering/cassandra
/clearwater/site1
/clearwater/site1/node_type_memcached
/clearwater/site1/node_type_memcached/clustering
/clearwater/site1/node_type_memcached/clustering/memcached
/clearwater/site1/node_type_chronos
/clearwater/site1/node_type_chronos/clustering
/clearwater/site1/node_type_chronos/clustering/chronos
root@cassandra-884159819-emude:/var/log# etcdctl -C http://10.0.1.14:4000 get /clearwater/node_type_cassandra/clustering/cassandra
{"10.0.1.14": "normal"}

The cluster manager logs are just log spam; they don't actually show a problem (https://github.com/Metaswitch/clearwater-etcd/issues/324).

What do the Cassandra logs say?

Cassandra is terminated repeatedly with return code 232, whatever that might mean. Lots of this.

2016-11-21T17:11:06.203049344Z 2016-11-21 17:11:06,202 INFO exited: cassandra (exit status 232; not expected)
2016-11-21T17:11:07.205551822Z 2016-11-21 17:11:07,205 INFO spawned: 'cassandra' with pid 7296
2016-11-21T17:11:08.207248024Z 2016-11-21 17:11:08,207 INFO success: cassandra entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-11-21T17:11:09.212715058Z 2016-11-21 17:11:09,212 INFO exited: cassandra (exit status 232; not expected)
2016-11-21T17:11:10.214685490Z 2016-11-21 17:11:10,214 INFO spawned: 'cassandra' with pid 7299
2016-11-21T17:11:11.216579461Z 2016-11-21 17:11:11,216 INFO success: cassandra entered RUNNING state, process has stayed up for > than 1 seconds 

Nothing in /var/log/cassandra. Let me know if there is somewhere else I can look.

Mm. Is there a /etc/cassandra/cassandra.yaml file?

Well. I wonder if something is preventing setting of IP addresses somewhere?

root@cassandra-2869798415-68885:/var/log/cassandra# cat /etc/cassandra/cassandra-topology.yaml 
# YAML topology configuration file for Cassandra,
# to be used with YamlFileNetworkTopologySnitch.

# The topology, as a list of data centers.
topology:
    # Each data center has a name and a list of racks.
    - dc_name: DC1
      racks:
          # Each rack has a name and a list of nodes.
          - rack_name: c1
            nodes:
                # Each node has a broadcast address (required)
                # and a data-center-local address (optional).
                # If dc_local_address is specified, its peers
                # in the same data center will attempt to
                # reconnect over that address instead.
                - broadcast_address: 1.2.3.4
                  dc_local_address: 5.6.7.8

# Default data center name for unknown nodes; defaults to "UNKNOWN"
# default_dc_name: UNKNOWN

# Default rack name for unknown nodes
# default_rack_name: UNKNOWN; defaults to "UNKNOWN"

That looks OK I think; we only touch the cassandra.yaml and the /etc/cassandra/cassandra-rackdc.properties files. What do these look like?

Neither of them exist.

Looks like there is code to change those files in `/usr/share/clearwater/clearwater-cluster-manager/plugins/cassandra_plugin.py' that presumably is not being run. Not sure when it should be run, or what might be causing that to fail. I do see the following in the cluster manager logs.

root@cassandra-2869798415-68885:/var/log/clearwater-cluster-manager# head -50 cluster-manager_20161121T150000Z.txt 
21-11-2016 15:19:16.825 UTC INFO utils.py:441 (thread MainThread): Acquired exclusive lock on /var/run/clearwater-cluster-manager.pid.lockfile
21-11-2016 15:19:16.825 UTC INFO plugin_loader.py:51 (thread MainThread): Inspecting cassandra_plugin.py
21-11-2016 15:19:16.895 UTC INFO plugin_loader.py:59 (thread MainThread): Loading cassandra_plugin.py
21-11-2016 15:19:16.895 UTC INFO plugin_loader.py:61 (thread MainThread): Loaded cassandra_plugin.py successfully
21-11-2016 15:19:16.896 UTC INFO main.py:215 (thread MainThread): Loaded plugin <cassandra_plugin.CassandraPlugin object at 0x7f54f881fb10>
21-11-2016 15:19:16.901 UTC INFO common_etcd_synchronizer.py:194 (thread CassandraPlugin): Read value {"10.0.2.8": "joining, acknowledged change", "10.0.0.12": "normal"} from etcd, comparing to last value None
21-11-2016 15:19:16.901 UTC INFO etcd_synchronizer.py:72 (thread CassandraPlugin): Got new state {"10.0.2.8": "joining, acknowledged change", "10.0.0.12": "normal"} from etcd
21-11-2016 15:19:16.901 UTC INFO synchronization_fsm.py:131 (thread CassandraPlugin): Entered state machine for 10.0.1.11 with local state None, cluster state started joining and cluster view {u'10.0.0.12': u'normal', u'10.0.2.8': u'joining, acknowledged change'}
21-11-2016 15:19:16.901 UTC INFO synchronization_fsm.py:49 (thread CassandraPlugin): Calling plugin method CassandraPlugin.on_startup
21-11-2016 15:19:16.903 UTC INFO common_etcd_synchronizer.py:194 (thread CassandraPlugin): Read value {"10.0.2.8": "joining, acknowledged change", "10.0.0.12": "normal"} from etcd, comparing to last value {"10.0.2.8": "joining, acknowledged change", "10.0.0.12": "normal"}
21-11-2016 15:19:16.903 UTC INFO common_etcd_synchronizer.py:197 (thread CassandraPlugin): Watching for changes with 22

Ok, what are the various nodes in your deployment? (10.0.2.8, 10.0.0.12, 10.0.1.14, 10.0.1.11)?

(I ask because the first logs show the local IP being 10.0.1.14, with a single entry in the Cassandra cluster of 10.0.1.14. The second logs show the local IP being 10.0.1.11, with two other Cassandra nodes of 10.0.2.8 and 10.0.0.12)

Very odd. There was just a single node, 10.0.1.14. I don't know what those other two nodes are. It's quite possible that 10.0.2.8 is a virtual IP that resolves to the same node through a load balancer (which would be a misconfiguration). It's also not impossible that I have fouled up and mixed up logs from two instantiations. Easiest thing to do is for me to nuke and rebuild the deployment and see what it looks like then.

I have a theory about this. I can repro all of the symptoms by starting the cassandra node without etcd running. That might well happen in production (etcd could come up after cassandra).

However, behaviour when etcd does come up is not consistent with what I say. On the original occasion, restarting etcd left the cluster manager broken, and so the Cassandra node never recovered. Today, everything just recovered within a minute or two (maybe faster - I wasn't timing carefully).

My workaround will be just to detect when Cassandra fails to start and nuke the Cassandra container when it does - i.e. add further liveness checks. I'm not sure where that then leaves the bug. There's clearly some edge condition where the cluster manager can fail to recover cleanly from an error, but it isn't consistent (reproduced it all day yesterday, but not today - and last week had a good and a bad day).

I have now reproduced this again, in a way that seems not to let it recover. Cassandra failed and restarted. After doing so, ,y Cassandra container has IP address 10.0.0.11. However:

root@cassandra-4023849174-vxcbo:/var/log/clearwater-cluster-manager# etcdctl -C http://etcd:4001 get /clearwater/node_type_cassandra/clustering/cassandra
{"10.0.0.9": "joining, acknowledged change"}

That suggests that there is a bad value in etcd for the cassandra IP. It could well be that the first time the Cassandra node came up, it wrote a value to etcd before it went down, and the second time it failed to overwrite that value. Is that plausible? Is there somewhere that documents these interactions between components?

When your Cassandra node restarts, does it get a different IP address? If it does, then from the cluster managers POV it's an entirely different node, and it can't overwrite the existing value. You'll need to remove the old IP address before the cluster manager will continue (e.g. http://clearwater.readthedocs.io/en/latest/Handling_Failed_Nodes.html#removing-a-failed-node).

OK - thanks. I think that makes sense; sometimes when you recreate a pod you get the same IP, but there are no guarantees. I can work round that.

One last stupid question. Why do we have a cassandra image in the first place? Why don't we just deploy a standard off the shelf cassandra image? Is it purely that the first Cassandra to come up has to do some provisioning of data into the Cassandra database?

I'm not sure this issue has any purpose any more - net is that the behaviour is a bit complex and I'm still a little confused, but having an issue doesn't really help with that and no good reason to think there is a bug. Closing.