cassandra start failed
Opened this issue · 14 comments
When I deploy cassandra deployment, it gives the error as follows:
ERROR [main] 2017-10-20 02:43:26,773 DatabaseDescriptor.java:141 - Fatal configuration error
org.apache.cassandra.exceptions.ConfigurationException: Expecting URI in variable: [cassandra.config]. Please prefix the file with file:/// for local files or file://<server>/ for remote files. Aborting. If you are executing this from an external tool, it needs to set Config.setClientMode(true) to avoid loading configuration.
at org.apache.cassandra.config.YamlConfigurationLoader.getStorageConfigURL(YamlConfigurationLoader.java:73) ~[apache-ca ssandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.config.YamlConfigurationLoader.loadConfig(YamlConfigurationLoader.java:84) ~[apache-cassandra-2 .1.15.jar:2.1.15]
at org.apache.cassandra.config.DatabaseDescriptor.loadConfig(DatabaseDescriptor.java:161) ~[apache-cassandra-2.1.15.jar :2.1.15]
at org.apache.cassandra.config.DatabaseDescriptor.<clinit>(DatabaseDescriptor.java:136) ~[apache-cassandra-2.1.15.jar:2 .1.15]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:170) [apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:566) [apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:655) [apache-cassandra-2.1.15.jar:2.1.15]
Then the pod logs are as follow:
2017-10-20 10:45:36,661 CRIT reaped unknown pid 5815)
2017-10-20 10:45:36,997 CRIT reaped unknown pid 5816)
2017-10-20 10:45:37,848 INFO success: cassandra entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-10-20 10:45:37,849 CRIT reaped unknown pid 5860)
2017-10-20 10:45:38,668 INFO exited: cassandra (exit status 232; not expected)
2017-10-20 10:45:39,671 INFO spawned: 'cassandra' with pid 5876
2017-10-20 10:45:39,701 CRIT reaped unknown pid 5892)
2017-10-20 10:45:40,023 CRIT reaped unknown pid 5893)
2017-10-20 10:45:40,675 INFO success: cassandra entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-10-20 10:45:40,676 CRIT reaped unknown pid 5942)
2017-10-20 10:45:41,711 INFO exited: cassandra (exit status 232; not expected)
2017-10-20 10:45:42,715 INFO spawned: 'cassandra' with pid 5958
2017-10-20 10:45:42,744 CRIT reaped unknown pid 5974)
2017-10-20 10:45:43,084 CRIT reaped unknown pid 5975)
2017-10-20 10:45:43,989 INFO success: cassandra entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-10-20 10:45:43,989 CRIT reaped unknown pid 6019)
2017-10-20 10:45:44,754 INFO exited: cassandra (exit status 232; not expected)
2017-10-20 10:45:45,759 INFO spawned: 'cassandra' with pid 6035
2017-10-20 10:45:45,804 CRIT reaped unknown pid 6051)
At the same time there is no file : /etc/cassandra/cassandra.yaml
Symptoms
So how can I solve the problem, thanks!
Impact
The kubernetes gives the following error:
Readiness probe failed: Reporting config ADDRESS=192.168.9.16 PORTS=7000 9042 9160 TIMEOUT=5 Testing port 7000
Release and environment
Steps to reproduce
Hi there @fogongzi .
I've had a bit of a look at this issue, and at #80. I think the two are likely symptoms of the same underlying error, and i think it's probably that you've got an issue with your etcd pods. Just a heads up: we have a mailinglist for Project Clearwater where you can reach out to the whole community. You may see a quicker response over there, and its good to keep the community up; they may well have hit the same issues before.
Can you confirm that you've got your etcd cluster up and running healthily? Are you following our set up steps from the README? If not, can you give us some more detail on the steps you're following, and how you're deploying all of this?
If your etcd cluster is up and healthy, we can try to work out why your other pods are unable to reach it. If you're on one of your cassandra pods, try taking a look in /var/log/clearwater-cluster-manager/ to see if that has any more informative logs. Similarly in the other issue you've raised, it looks like the configuration files are not being correctly generated, so look in the same logs.
As background here, we have a number of configuration files that are auto-generated based on information that is stored in the etcd cluster. The clearwater cluster manager is the process responsible for getting this information from etcd, and correctly building the configuration up. If your etcd cluster is unhealthy, or unreachable, the other processes wont be able to get the configuration they need to boot up successfully, as you're seeing here.
@johadalin , thank you very much for your reply.
The /var/log/clearwater-cluster-manager/cluster-manager.output.log gives no logs. The /var/log/clearwater-cluster-manager/cluster-manager_20171027T080000Z.txt gives logs as follows:
27-10-2017 00:30:36.406 UTC INFO utils.py:310 (thread MainThread): Acquired exclusive lock on /var/run/clearwater-cluster-manager.pid.lockfile
27-10-2017 00:30:36.406 UTC INFO plugin_loader.py:26 (thread MainThread): Inspecting cassandra_plugin.py
27-10-2017 00:30:36.466 UTC INFO plugin_loader.py:34 (thread MainThread): Loading cassandra_plugin.py
27-10-2017 00:30:36.466 UTC INFO plugin_loader.py:36 (thread MainThread): Loaded cassandra_plugin.py successfully
27-10-2017 00:30:36.467 UTC INFO main.py:180 (thread MainThread): Loaded plugin <cassandra_plugin.CassandraPlugin object at 0x7f0b74d585d0>
27-10-2017 00:30:36.469 UTC INFO main.py:189 (thread MainThread): Started thread for plugin <cassandra_plugin.CassandraPlugin object at 0x7f0b74d585d0>
27-10-2017 00:30:36.473 UTC INFO common_etcd_synchronizer.py:301 (thread CassandraPlugin): Read value {"192.168.10.21": "joining, acknowledged change", "192.168.0.9": "joining, acknowledged change", "192.168.4.15": "joining, acknowledged change"} from etcd, comparing to last value None
27-10-2017 00:30:36.473 UTC INFO etcd_synchronizer.py:47 (thread CassandraPlugin): Got new state {"192.168.10.21": "joining, acknowledged change", "192.168.0.9": "joining, acknowledged change", "192.168.4.15": "joining, acknowledged change"} from etcd
27-10-2017 00:30:36.473 UTC INFO synchronization_fsm.py:108 (thread CassandraPlugin): Entered state machine for 192.168.0.26 with local state None, cluster state joining, config changing and cluster view {u'192.168.10.21': u'joining, acknowledged change', u'192.168.0.9': u'joining, acknowledged change', u'192.168.4.15': u'joining, acknowledged change'}
27-10-2017 00:30:36.473 UTC INFO synchronization_fsm.py:24 (thread CassandraPlugin): Calling plugin method CassandraPlugin.on_startup
27-10-2017 00:30:36.481 UTC INFO common_etcd_synchronizer.py:301 (thread CassandraPlugin): Read value {"192.168.10.21": "joining, acknowledged change", "192.168.0.9": "joining, acknowledged change", "192.168.4.15": "joining, acknowledged change"} from etcd, comparing to last value {"192.168.10.21": "joining, acknowledged change", "192.168.0.9": "joining, acknowledged change", "192.168.4.15": "joining, acknowledged change"}
27-10-2017 00:30:36.481 UTC INFO common_etcd_synchronizer.py:304 (thread CassandraPlugin): Watching for changes with 32
At the same time the etcd service seems to be good and the logs are as follows:
2017-10-27 04:42:09.186533 I | etcdserver: saved snapshot at index 650065
2017-10-27 04:42:09.192831 I | etcdserver: compacted raft log at 645065
2017-10-27 04:42:35.581166 I | fileutil: purged file etcd-6dc95895-hqchb.etcd/member/snap/0000000000000002-00000000000927fc.snap successfully
2017-10-27 06:05:13.680968 I | etcdserver: start to snapshot (applied: 660066, lastsnap: 650065)
2017-10-27 06:05:13.683617 I | etcdserver: saved snapshot at index 660066
2017-10-27 06:05:13.686896 I | etcdserver: compacted raft log at 655066
2017-10-27 06:05:35.663023 I | fileutil: purged file etcd-6dc95895-hqchb.etcd/member/snap/0000000000000002-0000000000094f0d.snap successfully
2017-10-27 07:28:18.180113 I | etcdserver: start to snapshot (applied: 670067, lastsnap: 660066)
2017-10-27 07:28:18.182526 I | etcdserver: saved snapshot at index 670067
2017-10-27 07:28:18.182834 I | etcdserver: compacted raft log at 665067
2017-10-27 07:28:35.748722 I | fileutil: purged file etcd-6dc95895-hqchb.etcd/member/snap/0000000000000002-000000000009761e.snap
So what is the problem, I am stucked for few days and can't go on. Thank you very much. I am forwarding to your reply!
@fogongzi
Is that the end of the cluster manager log?
Can you give us a bit more detail on how you're deploying this, and what version of Kubernetes you're deploying on?
Have you made any modifications to the deployment files at all?
I've tried a number of things on my side, and can't get these symptoms to replicate. It might be worth trying a fresh re-deploy, making sure you're using the latest code, and the latest images (rebuilding with the no cached flag set).
If you attach as much in the way of logs and diagnostics as you can, we should hopefully get to the bottom of it as soon as possible
@johadalin ,yes, It is the end of the cluster log. My environment is as follows:
kubernetes: 1.8.0
os: Ubuntu 17.04
Docker: 17.05.0-ce
clearwater-docker: clearwater-docker-release-128
All the deployment and service are located in namespace: clearwater.
Firstly I create configmap:
kubectl create configmap env-vars --from-literal=ZONE=clearwater.svc.cluster.local -n clearwater
Then the astaire-depl.yaml is as follows:
kind: Deployment
metadata:
name: astaire
namespace: clearwater
spec:
replicas: 3
template:
metadata:
labels:
service: astaire
spec:
containers:
- image: "10.71.156.242/clearwater/astaire:latest"
name: astaire
ports:
- containerPort: 22
- containerPort: 11211
- containerPort: 11311
envFrom:
- configMapRef:
name: env-vars
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
livenessProbe:
tcpSocket:
port: 11311
initialDelaySeconds: 30
readinessProbe:
tcpSocket:
port: 11311
volumeMounts:
- name: astairelogs
mountPath: /var/log/astaire
- name: tz-config
mountPath: /etc/localtime
- image: 10.71.156.242/clearwater/busybox:latest
name: tailer
command: [ "tail", "-F", "/var/log/astaire/astaire_current.txt" ]
volumeMounts:
- name: astairelogs
mountPath: /var/log/astaire
- name: tz-config
mountPath: /etc/localtime
volumes:
- name: astairelogs
emptyDir: {}
- name: tz-config
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
restartPolicy: Always
The cassandra-depl.yaml is as follows:
kind: Deployment
metadata:
name: cassandra
namespace: clearwater
spec:
replicas: 3
template:
metadata:
labels:
service: cassandra
spec:
containers:
- image: "10.71.156.242/clearwater/cassandra:0.2"
name: cassandra
ports:
- containerPort: 22
- containerPort: 7001
- containerPort: 9042
- containerPort: 9160
envFrom:
- configMapRef:
name: env-vars
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
livenessProbe:
exec:
command: ["/bin/bash", "/usr/share/kubernetes/liveness.sh", "7000 9042 9160"]
# Cassandra can take a very, very long time to start up
initialDelaySeconds: 600
readinessProbe:
exec:
command: ["/bin/bash", "/usr/share/kubernetes/liveness.sh", ""] #7000 9042 9160
volumeMounts:
- name: tz-config
mountPath: /etc/localtime
volumes:
- name: tz-config
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
restartPolicy: Always
The etcd-depl.yaml is as follows:
kind: Deployment
metadata:
name: etcd
namespace: clearwater
spec:
replicas: 1
template:
metadata:
creationTimestamp: null
labels:
instance-type: etcd-pod
spec:
containers:
- args:
- --name
- $(MY_POD_NAME)
- --advertise-client-urls
- http://$(MY_POD_IP):2379,http://$(MY_POD_IP):4001
- --listen-client-urls
- http://0.0.0.0:2379,http://0.0.0.0:4001
- --initial-advertise-peer-urls
- http://$(MY_POD_IP):2380
- --listen-peer-urls
- http://0.0.0.0:2380
# By default use a single pod cluster
- --initial-cluster
- $(MY_POD_NAME)=http://$(MY_POD_IP):2380
# Alternatively multi-pod clusters can be supported Using central discvovery. Run e.g.
# curl https://discovery.etcd.io/new?size=3 | sed s/https/http/
# to get a discovery URL for a 3 pod cluster, substitute the returned value below, and
# set replicas: 3 above.
#- --discovery
#- <URL returned by command above>
- --initial-cluster-state
- new
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
image: 10.71.156.242/clearwater/etcd:v2.2.5
name: etcd
ports:
- containerPort: 2379
- containerPort: 2380
- containerPort: 4001
livenessProbe:
tcpSocket:
port: 4001
initialDelaySeconds: 300
readinessProbe:
tcpSocket:
port: 4001
volumeMounts:
- name: tz-config
mountPath: /etc/localtime
volumes:
- name: tz-config
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
restartPolicy: Always
The homestead-prov-depl.yaml is as follows:
kind: Deployment
metadata:
name: homestead-prov
namespace: clearwater
spec:
replicas: 1
selector:
matchLabels:
service: homestead-prov
template:
metadata:
labels:
service: homestead-prov
snmp: enabled
spec:
containers:
- image: "10.71.156.242/clearwater/homestead-prov:latest"
name: homestead-prov
ports:
- containerPort: 22
- containerPort: 8889
envFrom:
- configMapRef:
name: env-vars
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
livenessProbe:
exec:
command: ["/bin/bash", "/usr/share/clearwater/bin/poll_homestead-prov.sh"]
initialDelaySeconds: 60
readinessProbe:
exec:
command: ["/bin/bash", "/usr/share/clearwater/bin/poll_homestead-prov.sh"]
volumeMounts:
- name: tz-config
mountPath: /etc/localtime
volumes:
- name: tz-config
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
restartPolicy: Always
The homestead-depl.yaml is as follows:
kind: Deployment
metadata:
name: homestead
namespace: clearwater
spec:
replicas: 1
selector:
matchLabels:
service: homestead
template:
metadata:
labels:
service: homestead
snmp: enabled
spec:
containers:
- image: "10.71.156.242/clearwater/homestead:latest"
name: homestead
ports:
- containerPort: 22
- containerPort: 8888
envFrom:
- configMapRef:
name: env-vars
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
livenessProbe:
exec:
command: ["/bin/bash", "/usr/share/kubernetes/liveness.sh", "8888"]
initialDelaySeconds: 60
readinessProbe:
exec:
command: ["/bin/bash", "/usr/share/kubernetes/liveness.sh", "8888"]
volumeMounts:
- name: homesteadlogs
mountPath: /var/log/homestead
- name: tz-config
mountPath: /etc/localtime
- image: 10.71.156.242/clearwater/busybox:latest
name: tailer
command: [ "tail", "-F", "/var/log/homestead/homestead_current.txt" ]
volumeMounts:
- name: homesteadlogs
mountPath: /var/log/homestead
- name: tz-config
mountPath: /etc/localtime
volumes:
- name: homesteadlogs
emptyDir: {}
- name: tz-config
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
restartPolicy: Always
When I build the docker images, it gives a lot of errors such as:
* Restarting clearwater-infrastructure clearwater-infrastructure
mv: cannot move ‘/tmp/hosts.3013’ to ‘/etc/hosts’: Device or resource busy
* Restarting DNS forwarder and DHCP server dnsmasq
dnsmasq: setting capabilities failed: Operation not permitted
...fail!
hostname: you must be root to change the host name
mv: cannot move ‘/tmp/hosts.3046’ to ‘/etc/hosts’: Device or resource busy
ERROR: unable to move /tmp/hosts.3046 to /etc/hosts
...done.
Setting up homestead-cassandra (1.0-170911.101748) ...
* Restarting clearwater-infrastructure clearwater-infrastructure
mv: cannot move ‘/tmp/hosts.3069’ to ‘/etc/hosts’: Device or resource busy
* Restarting DNS forwarder and DHCP server dnsmasq
dnsmasq: setting capabilities failed: Operation not permitted
...fail!
hostname: you must be root to change the host name
mv: cannot move ‘/tmp/hosts.3102’ to ‘/etc/hosts’: Device or resource busy
ERROR: unable to move /tmp/hosts.3102 to /etc/hosts
...done.
Setting up homestead-prov-cassandra (1.0-170911.105444) ...
* Restarting clearwater-infrastructure clearwater-infrastructure
mv: cannot move ‘/tmp/hosts.3125’ to ‘/etc/hosts’: Device or resource busy
* Restarting DNS forwarder and DHCP server dnsmasq
dnsmasq: setting capabilities failed: Operation not permitted
...fail!
hostname: you must be root to change the host name
mv: cannot move ‘/tmp/hosts.3158’ to ‘/etc/hosts’: Device or resource busy
ERROR: unable to move /tmp/hosts.3158 to /etc/hosts
...done.
But it can build succesfully, I can not make sure it is the root cause. Thanks very much, I am forward to your reply!
@fogongzi thanks for providing all that information. Everything looks mostly as i expect it to. I don't believe the errors during the image build are an issue; we should re-vamp the install process to be more container friendly, but i have seen similar and my images are working when deployed.
The only significant difference i can see to what i've run is that you're running Kubernetes 1.80, where my rig is 1.7. I've had a look at the changelog and can't see anything i think would cause the issues you're seeing though.
I think a reasonable next step here would be to re-deploy with 1 of each of Astaire and Cassandra, rather than 3, to see if there is something inherently wrong with the container deployment, or if it's more down to clustering issues.
If you can try the following, and let us know if you have any more success:
- copy your yaml files off to a backup location
- clean your checkout of clearwater-docker and re-generate the yaml files, using the
k8s-gencfg
script - change the number of replicas for astaire and cassandra down to 1
- re-deploy using the
kubectl apply -f kubernetes/
command
Hopefully the pods will all come up, and reach a healthy state, but if not, we should take a look in the same logs as above. What we would want to find is some reason why the nodes are not progressing to a fully joined state in the cluster, at which point we would create the configuration files for the actual processes to start up with.
@johadalin thank you for your reply. I change the number of replicas for astaire and cassandra down to 1 and the cassandra gives the following error logs:
cassandra pod gives error logs: Readiness probe failed: Reporting config ADDRESS=192.168.9.43 PORTS=7000 9042 9160 TIMEOUT=5 Testing port 7000
The cassandra's real error logs are as follows (the error logs seems to generate periodically):
INFO [main] 2017-11-03 08:29:12,320 StorageService.java:974 - Using saved tokens [-1002911311179903270, -1239551314919162425, -1396818904255654555, -1417491954110581071, -1542839607472176293, -1610323995369370126, -1612687612080317804, -1627458093511093157, -1628155921005580752, -1707762623631794992, -1718692680057530928, -1859984255872059717, -1895430511075987888, -203340042280131667, -2076927207440279742, -2109935103591351908, -2132001493581103466, -2160685078639584252, -2170759777099907938, -2183660516190750443, -2213658775407507524, -2423388363331041241, -2478841917295130471, -2485336344640396931, -2508572118671197085, -2535638465691150229, -2613120280597413176, -2629929443739442465, -2634080364452656214, -2676322277960435401, -2690000933565104954, -269208829073768877, -2706621544282571939, -2721973837787782384, -2791132693031042987, -2822295096383768937, -2847659533597577636, -2866883482785265870, -2921049962562903950, -3014186123279831556, -3034507308967578716, -305523296939239758, -3251612140250166982, -3533258969904091785, -3552636323995173565, -3598266835606402966, -3609857190624555873, -363597706630921723, -3769950090272629908, -3784308204697015644, -3972888869395517255, -3975766817178930694, -4016901592732589236, -4098596438972835726, -4099608180375073585, -4125086501429938002, -4199942863838017639, -4219779984816459503, -4273636457686598905, -4328378568307846686, -4340905938891227789, -4403468943675707345, -4422758874374067808, -4432660525160703432, -4499166876418269663, -4537066685417185724, -4615988357639509833, -466948510156377065, -4683021051278446562, -4784055117485603818, -4787201785662169025, -4953732349541074868, -4956326965134434028, -5011284076400703508, -5032916713653567939, -5096642761759838270, -51534640691121022, -5193433577721336730, -5327388188286466470, -5348493802979382490, -5424611097417779520, -5518535020434321226, -5545826293674745680, -5565395759734286901, -56571577194227796, -5812396733947428826, -5861679617741867872, -5864206798218477757, -5896945651941884525, -6256878053245852074, -6343665889522855970, -6359020927275779337, -6377154647784798838, -6428417484642262621, -6459936780844056781, -6485727026158445151, -65182003029243325, -6560725974396524988, -657776629542333610, -6700250026856631948, -6752428622574810653, -6761980090851345102, -6778943527574650426, -6977544012587241801, -7083377094087347425, -7169938561248967154, -7190830651726583696, -7209961479162328381, -7319616175063187634, -7530117145182770909, -7559392950860586726, -7659622417469445474, -7683952674701911748, -7691562084975445233, -7700326306038256941, -7706101555087830917, -7706315209602325363, -7860805356102382134, -7930697606760385969, -8009056631490472022, -8056639770277017381, -8092899444193601941, -8130040016946505459, -8250304849411610165, -8294569071325869291, -8378900993252796135, -8445881912836541826, -8493986330344581663, -8523761666788099367, -856150730904680761, -8641789529340216384, -8704078546832479970, -8729233576828407117, -8769343336205397456, -895514321405823665, -8965500698812846805, -9040797486144184292, -9125618245651914552, 1099671871962369123, 1159908670622073955, 1216767999119144151, 130067841805710918, 1308113952862478205, 1528731926570102875, 1577210426255407499, 1676008009932807124, 1689081752434220642, 1708635819981425373, 173523211397974546, 1739228855593782636, 1750078950364632070, 1819482699220102787, 1830147386273346461, 1974029411930926603, 2067232554112743508, 2104584763390478741, 2149949821351440436, 2166556824623624430, 2200783797699035058, 2404598106087632783, 245329773479374452, 2485171968492730620, 2591725859339795329, 2624714445819574529, 2822376956082748393, 293223226254836328, 2936478810222152952, 2991313597715514042, 3019095453529649965, 3041184265586620649, 3096599500137672701, 3153579508197940011, 3183068343704717713, 3348613576129576547, 3354791859463959187, 3367398607159211855, 3432913789572020546, 354936215299144726, 3558252878926543813, 3583619352757835745, 3599755697034808757, 3635251184905118070, 3678339180557554807, 3882028965999522994, 4023596519256514758, 4025872366558763959, 4129590726930147095, 4198097724359937884, 4326815123766586229, 4492444748648606740, 4768020125774890264, 4863523444340797971, 4941244055870701594, 5001817881332512763, 5031876722432162161, 5233719238185127308, 527404878789184423, 5392045649173700950, 5500362976233979370, 558164566830338551, 559069776993290759, 5696092500116769643, 577550101984314049, 581560445821839105, 5827422807675282965, 5863591965934423069, 5920891670689921114, 5923967540301502530, 5985061482159679566, 5989392561349717150, 5998060358345508537, 6083195680206276195, 6094962951747988985, 6102237344156522852, 6306895849291797553, 63215307337809276, 6335294786065322196, 6338087246076349737, 6461651850333637882, 6487679815124626191, 6711519485175039694, 6843347088743121391, 6883649379784861371, 7144731802063268256, 7145934895626887259, 7155819950222278210, 7190493049717319507, 7227475459149055373, 7297820006705720933, 7304906664669670502, 7332554399647218215, 7390875897149873276, 7464464462229898536, 7590542999344700048, 7725551583691827817, 7785714960381949918, 7889412068912076527, 7892714622904469126, 8009431018921565078, 8034640053816718961, 8054341190174618967, 8106743032491696373, 8114264874237763765, 8137718763394820169, 8185117072930234466, 8246530688398275866, 8453037989138682649, 8481218657289412448, 8634731572102959069, 8687134969523773682, 8698257334978983642, 879195382667033452, 8859110321938069158, 9034077468166818883, 9056187891440121322, 986180542946552204]
INFO [main] 2017-11-03 08:29:12,321 ColumnFamilyStore.java:905 - Enqueuing flush of local: 578 (0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:2] 2017-11-03 08:29:12,322 Memtable.java:347 - Writing Memtable-local@487054422(0.081KiB serialized bytes, 4 ops, 0%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:2] 2017-11-03 08:29:12,323 Memtable.java:382 - Completed flushing /var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-tmp-ka-72-Data.db (0.000KiB) for commitlog position ReplayPosition(segmentId=1509697749456, position=97792)
INFO [CompactionExecutor:1] 2017-11-03 08:29:12,375 CompactionTask.java:141 - Compacting [SSTableReader(path='/var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-72-Data.db'), SSTableReader(path='/var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-69-Data.db'), SSTableReader(path='/var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-71-Data.db'), SSTableReader(path='/var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-70-Data.db')]
INFO [main] 2017-11-03 08:29:12,403 ColumnFamilyStore.java:905 - Enqueuing flush of local: 51487 (0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:1] 2017-11-03 08:29:12,405 Memtable.java:347 - Writing Memtable-local@1435676097(8.373KiB serialized bytes, 259 ops, 0%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:1] 2017-11-03 08:29:12,414 Memtable.java:382 - Completed flushing /var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-tmp-ka-74-Data.db (0.000KiB) for commitlog position ReplayPosition(segmentId=1509697749456, position=109878)
INFO [CompactionExecutor:1] 2017-11-03 08:29:12,479 CompactionTask.java:274 - Compacted 4 sstables to [/var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-73,]. 6,303 bytes to 5,712 (~90% of original) in 102ms = 0.053406MB/s. 4 total partitions merged to 1. Partition merge counts were {4:1, }
INFO [main] 2017-11-03 08:29:12,523 StorageService.java:1715 - Node /192.168.9.43 state jump to NORMAL
INFO [main] 2017-11-03 08:29:12,543 CassandraDaemon.java:613 - Waiting for gossip to settle before accepting client requests...
INFO [main] 2017-11-03 08:29:20,544 CassandraDaemon.java:645 - No gossip backlog; proceeding
INFO [main] 2017-11-03 08:29:20,625 Server.java:155 - Netty using native Epoll event loop
INFO [main] 2017-11-03 08:29:20,668 Server.java:193 - Using Netty Version: [netty-buffer=netty-buffer-4.0.23.Final.208198c, netty-codec=netty-codec-4.0.23.Final.208198c, netty-codec-http=netty-codec-http-4.0.23.Final.208198c, netty-codec-socks=netty-codec-socks-4.0.23.Final.208198c, netty-common=netty-common-4.0.23.Final.208198c, netty-handler=netty-handler-4.0.23.Final.208198c, netty-transport=netty-transport-4.0.23.Final.208198c, netty-transport-rxtx=netty-transport-rxtx-4.0.23.Final.208198c, netty-transport-sctp=netty-transport-sctp-4.0.23.Final.208198c, netty-transport-udt=netty-transport-udt-4.0.23.Final.208198c]
INFO [main] 2017-11-03 08:29:20,668 Server.java:194 - Starting listening for CQL clients on /0:0:0:0:0:0:0:0:9042...
INFO [main] 2017-11-03 08:29:20,746 ThriftServer.java:119 - Binding thrift service to /0:0:0:0:0:0:0:0:9160
ERROR [main] 2017-11-03 08:29:20,756 CassandraDaemon.java:583 - Exception encountered during startup
java.lang.RuntimeException: Unable to create thrift socket to /0:0:0:0:0:0:0:0:9160
at org.apache.cassandra.thrift.CustomTThreadPoolServer$Factory.buildTServer(CustomTThreadPoolServer.java:270) ~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.thrift.TServerCustomFactory.buildTServer(TServerCustomFactory.java:46) ~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.<init>(ThriftServer.java:131) ~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.thrift.ThriftServer.start(ThriftServer.java:58) ~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.service.CassandraDaemon.start(CassandraDaemon.java:502) [apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:579) [apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:655) [apache-cassandra-2.1.15.jar:2.1.15]
Caused by: org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address /0:0:0:0:0:0:0:0:9160.
at org.apache.cassandra.thrift.TCustomServerSocket.<init>(TCustomServerSocket.java:75) ~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.thrift.CustomTThreadPoolServer$Factory.buildTServer(CustomTThreadPoolServer.java:265) ~[apache-cassandra-2.1.15.jar:2.1.15]
... 6 common frames omitted
INFO [StorageServiceShutdownHook] 2017-11-03 08:29:20,772 Server.java:225 - Stop listening for CQL clients
INFO [StorageServiceShutdownHook] 2017-11-03 08:29:20,772 Gossiper.java:1454 - Announcing shutdown
INFO [StorageServiceShutdownHook] 2017-11-03 08:29:20,774 StorageService.java:1715 - Node /192.168.9.43 state jump to shutdown
The only good thing is that there exists file: cassandra.yaml and cassandra-rackdc.properties in /etc/cassandra/, however it still gives the logs.
The etcd logs seems to good and the logs are as follows:
2017-11-03 16:18:57.074761 I | etcdserver: snapshot count = 10000
2017-11-03 16:18:57.074777 I | etcdserver: advertise client URLs = http://192.168.9.42:2379,http://192.168.9.42:4001
2017-11-03 16:18:57.074813 I | etcdserver: initial advertise peer URLs = http://192.168.9.42:2380
2017-11-03 16:18:57.074834 I | etcdserver: initial cluster = etcd-6dc95895-9s9n4=http://192.168.9.42:2380
2017-11-03 16:18:57.095371 I | etcdserver: starting member 7a47d56d3db18ad0 in cluster ccca1ed2d279b3fe
2017-11-03 16:18:57.095658 I | raft: 7a47d56d3db18ad0 became follower at term 0
2017-11-03 16:18:57.095691 I | raft: newRaft 7a47d56d3db18ad0 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2017-11-03 16:18:57.095722 I | raft: 7a47d56d3db18ad0 became follower at term 1
2017-11-03 16:18:57.096123 I | etcdserver: starting server... [version: 2.2.5, cluster version: to_be_decided]
2017-11-03 16:18:57.097688 N | etcdserver: added local member 7a47d56d3db18ad0 [http://192.168.9.42:2380] to cluster ccca1ed2d279b3fe
2017-11-03 16:18:57.796256 I | raft: 7a47d56d3db18ad0 is starting a new election at term 1
2017-11-03 16:18:57.796428 I | raft: 7a47d56d3db18ad0 became candidate at term 2
2017-11-03 16:18:57.796438 I | raft: 7a47d56d3db18ad0 received vote from 7a47d56d3db18ad0 at term 2
2017-11-03 16:18:57.796533 I | raft: 7a47d56d3db18ad0 became leader at term 2
2017-11-03 16:18:57.796560 I | raft: raft.node: 7a47d56d3db18ad0 elected leader 7a47d56d3db18ad0 at term 2
2017-11-03 16:18:57.797533 I | etcdserver: published {Name:etcd-6dc95895-9s9n4 ClientURLs:[http://192.168.9.42:2379 http://192.168.9.42:4001]} to cluster ccca1ed2d279b3fe
2017-11-03 16:18:57.797594 I | etcdserver: setting up the initial cluster version to 2.2
2017-11-03 16:18:57.799617 N | etcdserver: set the initial cluster version to 2.2
The Astaire pod seems good and the logs are as follows:
03-11-2017 08:33:35.151 UTC Status proxy_server.cpp:171: Accepted socket from [::ffff:192.168.10.1]:43176
03-11-2017 08:33:35.151 UTC Status proxy_server.cpp:208: Starting connection thread for [::ffff:192.168.10.1]:43176
03-11-2017 08:33:35.152 UTC Status proxy_server.cpp:289: Client [::ffff:192.168.10.1]:43176 has disconnected
03-11-2017 08:33:35.642 UTC Status proxy_server.cpp:171: Accepted socket from [::ffff:192.168.10.1]:44712
03-11-2017 08:33:35.642 UTC Status proxy_server.cpp:208: Starting connection thread for [::ffff:192.168.10.1]:44712
03-11-2017 08:33:35.642 UTC Status proxy_server.cpp:289: Client [::ffff:192.168.10.1]:44712 has disconnected
03-11-2017 08:33:45.151 UTC Status proxy_server.cpp:171: Accepted socket from [::ffff:192.168.10.1]:49832
03-11-2017 08:33:45.152 UTC Status proxy_server.cpp:208: Starting connection thread for [::ffff:192.168.10.1]:49832
03-11-2017 08:33:45.152 UTC Status proxy_server.cpp:289: Client [::ffff:192.168.10.1]:49832 has disconnected
03-11-2017 08:33:45.642 UTC Status proxy_server.cpp:171: Accepted socket from [::ffff:192.168.10.1]:51368
03-11-2017 08:33:45.642 UTC Status proxy_server.cpp:208: Starting connection thread for [::ffff:192.168.10.1]:51368
03-11-2017 08:33:45.642 UTC Status proxy_server.cpp:289: Client [::ffff:192.168.10.1]:51368 has disconnected
03-11-2017 08:33:55.151 UTC Status proxy_server.cpp:171: Accepted socket from [::ffff:192.168.10.1]:54952
03-11-2017 08:33:55.152 UTC Status proxy_server.cpp:208: Starting connection thread for [::ffff:192.168.10.1]:54952
03-11-2017 08:33:55.152 UTC Status proxy_server.cpp:289: Client [::ffff:192.168.10.1]:54952 has disconnected
03-11-2017 08:33:55.642 UTC Status proxy_server.cpp:171: Accepted socket from [::ffff:192.168.10.1]:56488
03-11-2017 08:33:55.642 UTC Status proxy_server.cpp:208: Starting connection thread for [::ffff:192.168.10.1]:56488
03-11-2017 08:33:55.642 UTC Status proxy_server.cpp:289: Client [::ffff:192.168.10.1]:56488 has disconnected
The deployment yaml files are as follows:
Then the astaire-depl.yaml is as follows:
kind: Deployment
metadata:
name: astaire
namespace: clearwater
spec:
replicas: 1
template:
metadata:
labels:
service: astaire
spec:
containers:
- image: "10.71.156.242/clearwater/astaire:latest"
name: astaire
ports:
- containerPort: 22
- containerPort: 11211
- containerPort: 11311
envFrom:
- configMapRef:
name: env-vars
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
livenessProbe:
tcpSocket:
port: 11311
initialDelaySeconds: 30
readinessProbe:
tcpSocket:
port: 11311
volumeMounts:
- name: astairelogs
mountPath: /var/log/astaire
- name: tz-config
mountPath: /etc/localtime
- image: 10.71.156.242/clearwater/busybox:latest
name: tailer
command: [ "tail", "-F", "/var/log/astaire/astaire_current.txt" ]
volumeMounts:
- name: astairelogs
mountPath: /var/log/astaire
- name: tz-config
mountPath: /etc/localtime
volumes:
- name: astairelogs
emptyDir: {}
- name: tz-config
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
restartPolicy: Always
The cassandra-depl.yaml is as follows:
kind: Deployment
metadata:
name: cassandra
namespace: clearwater
spec:
replicas: 1
template:
metadata:
labels:
service: cassandra
spec:
containers:
- image: "10.71.156.242/clearwater/cassandra:0.2"
name: cassandra
ports:
- containerPort: 22
- containerPort: 7001
- containerPort: 9042
- containerPort: 9160
envFrom:
- configMapRef:
name: env-vars
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
livenessProbe:
exec:
command: ["/bin/bash", "/usr/share/kubernetes/liveness.sh", "7000 9042 9160"]
initialDelaySeconds: 600
readinessProbe:
exec:
command: ["/bin/bash", "/usr/share/kubernetes/liveness.sh", ""] #7000 9042 9160
volumeMounts:
- name: tz-config
mountPath: /etc/localtime
volumes:
- name: tz-config
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
restartPolicy: Always
The etcd-depl.yaml is as follows:
kind: Deployment
metadata:
name: etcd
namespace: clearwater
spec:
replicas: 1
template:
metadata:
creationTimestamp: null
labels:
instance-type: etcd-pod
spec:
containers:
- args:
- --name
- $(MY_POD_NAME)
- --advertise-client-urls
- http://$(MY_POD_IP):2379,http://$(MY_POD_IP):4001
- --listen-client-urls
- http://0.0.0.0:2379,http://0.0.0.0:4001
- --initial-advertise-peer-urls
- http://$(MY_POD_IP):2380
- --listen-peer-urls
- http://0.0.0.0:2380
# By default use a single pod cluster
- --initial-cluster
- $(MY_POD_NAME)=http://$(MY_POD_IP):2380
## Alternatively multi-pod clusters can be supported Using central discvovery. Run e.g.
## curl https://discovery.etcd.io/new?size=3 | sed s/https/http/
## to get a discovery URL for a 3 pod cluster, substitute the returned value below, and
## set replicas: 3 above.
##- --discovery
##-
- --initial-cluster-state
- new
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
image: 10.71.156.242/clearwater/etcd:v2.2.5
name: etcd
ports:
- containerPort: 2379
- containerPort: 2380
- containerPort: 4001
livenessProbe:
tcpSocket:
port: 4001
initialDelaySeconds: 300
readinessProbe:
tcpSocket:
port: 4001
volumeMounts:
- name: tz-config
mountPath: /etc/localtime
volumes:
- name: tz-config
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
restartPolicy: Always
Therefore, what is the root cause and how can I do. Thanks very much, I am forward to your reply!
I've had another look over all the diags you've provided, and think i have seen something new regarding the first issue you saw. (I've also added markdown backticks ` around the code/log sections so that it's a bit clearer. Github was parsing the # comments in your depl files as heading tags)
First though, I think the current issue you're seeing with cassandra is down to an issue we've seen with network configuration on the Kubernetes hosts; specifically, cassandra is attempting to bind to the ipv6 loopback interface in the container, but i am guessing it isn't there. The line java.lang.RuntimeException: Unable to create thrift socket to /0:0:0:0:0:0:0:0:9160
shows this.
I'd be interested to see what the interface within the container looks like. If you can run kubectl exec <cassandra pod name> -it /bin/bash
, and then grab the output of ifconfig
. For comparison, on one of my cassandra pods, i get:
eth0 Link encap:Ethernet HWaddr 0a:58:0a:e6:11:ec
inet addr:10.230.17.236 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::dc9c:c3ff:fec5:947a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:96 errors:0 dropped:0 overruns:0 frame:0
TX packets:65 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:10925 (10.9 KB) TX bytes:5607 (5.6 KB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:31 errors:0 dropped:0 overruns:0 frame:0
TX packets:31 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:3911 (3.9 KB) TX bytes:3911 (3.9 KB)
We believe this may be resolvable by changing some system config on the host. Particularly, we think ipv6 needs to be enabled, with the sysctl option net.ipv6.conf.all.disable_ipv6
being set to 0. Check your /etc/sysctl.conf
file, and make sure this is the case.
Regarding the issue you were having with 3 replicas of each node, the telling line in the cluster manager logs from your cassandra node is: 27-10-2017 00:30:36.473 UTC INFO synchronization_fsm.py:108 (thread CassandraPlugin): Entered state machine for 192.168.0.26 with local state None, cluster state joining, config changing and cluster view {u'192.168.10.21': u'joining, acknowledged change', u'192.168.0.9': u'joining, acknowledged change', u'192.168.4.15': u'joining, acknowledged change'}
Your depl files specified three replicas for cassandra, but from this log we can see that there are four IPs involved in the clustering. I imagine what has happened is the one of the cassandra pods died and was restarted during the clustering process, and the new pod you took the logs from came up in its place. However, because the clustering state machine was mid-way through (all nodes in a joining, acknowledged change
state), we are unable to add anything new to the cluster; and because one of the joining nodes no longer exists, the current clustering operation can never complete.
To resolve this, i'd like to check a couple of things:
- Can you make sure you are completely removing all pods completely before creating a new deployment. Running
kubectl delete -f kubernetes/
in clearwater docker should work, but make sure all pods are gone before creating new ones - What command are you using to create a new deployment? If you are hitting clustering issues when using
kubectl apply -f kubernetes/
, it may be a race condition specific to the conditions on your kubernetes cluster, where nodes come up too slowly or something like that.
Once you have resolved the ipv6 issue, you can go back to attempting to deploy 3 of each of Cassandra and Astaire. If you still hit the clustering issue, i would recommend creating the etcd cluster first, with the etcd depl and svc files, and then once that is up and healthy, creating the cassandra and astaire clusters one at a time. Then you can apply the rest of the depl and svc files to get the full deployment up.
@johadalin thank you very much for your reply. The cassandra pod ifconfig command gives :
eth0 Link encap:Ethernet HWaddr 0a:58:c0:a8:09:2b
inet addr:192.168.9.43 Bcast:0.0.0.0 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:52833 errors:0 dropped:0 overruns:0 frame:0
TX packets:69152 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:6104499 (6.1 MB) TX bytes:5446477 (5.4 MB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:5772011 errors:0 dropped:0 overruns:0 frame:0
TX packets:5772011 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:868048054 (868.0 MB) TX bytes:868048054 (868.0 MB)
It seems to be the same as yours.
Next I add net.ipv6.conf.all.disable_ipv6=o to all the kubernetes worker nodes in /etc/sysctl.conf and then execute sysctl -p. I delete all the services and deployment in the clearwater namespace. Then firstly I create deployment etcd and it runs well. Secondly I create deployment astaire and it runs well. Thirdly I runs cassandra, it also gives the following error:
, 811732695232594116, 8503097101771197545, 8513530876362087059, 863552059906684172, 8684728804467211930, 8836300250236889749, 8872983070166642449, 9021528787917249437, 9038930397295201331, 9057969618895260764, 914248114040531751, 945138989064379977, 959995504895334597]
INFO [main] 2017-11-06 02:01:27,035 ColumnFamilyStore.java:905 - Enqueuing flush of local: 578 (0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:2] 2017-11-06 02:01:27,036 Memtable.java:347 - Writing Memtable-local@1272396707(0.081KiB serialized bytes, 4 ops, 0%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:2] 2017-11-06 02:01:27,037 Memtable.java:382 - Completed flushing /var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-tmp-ka-78-Data.db (0.000KiB) for commitlog position ReplayPosition(segmentId=1509933684694, position=94139)
INFO [main] 2017-11-06 02:01:27,079 ColumnFamilyStore.java:905 - Enqueuing flush of local: 51458 (0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:1] 2017-11-06 02:01:27,080 Memtable.java:347 - Writing Memtable-local@342594865(8.345KiB serialized bytes, 259 ops, 0%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:1] 2017-11-06 02:01:27,086 Memtable.java:382 - Completed flushing /var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-tmp-ka-79-Data.db (0.000KiB) for commitlog position ReplayPosition(segmentId=1509933684694, position=106108)
INFO [main] 2017-11-06 02:01:27,179 StorageService.java:1715 - Node /192.168.9.46 state jump to NORMAL
INFO [main] 2017-11-06 02:01:27,228 CassandraDaemon.java:613 - Waiting for gossip to settle before accepting client requests...
INFO [ScheduledTasks:1] 2017-11-06 02:01:27,445 TokenMetadata.java:433 - Updating topology for all endpoints that have changed
INFO [main] 2017-11-06 02:01:35,229 CassandraDaemon.java:645 - No gossip backlog; proceeding
INFO [main] 2017-11-06 02:01:35,364 Server.java:155 - Netty using native Epoll event loop
INFO [main] 2017-11-06 02:01:35,414 Server.java:193 - Using Netty Version: [netty-buffer=netty-buffer-4.0.23.Final.208198c, netty-codec=netty-codec-4.0.23.Final.208198c, netty-codec-http=netty-codec-http-4.0.23.Final.208198c, netty-codec-socks=netty-codec-socks-4.0.23.Final.208198c, netty-common=netty-common-4.0.23.Final.208198c, netty-handler=netty-handler-4.0.23.Final.208198c, netty-transport=netty-transport-4.0.23.Final.208198c, netty-transport-rxtx=netty-transport-rxtx-4.0.23.Final.208198c, netty-transport-sctp=netty-transport-sctp-4.0.23.Final.208198c, netty-transport-udt=netty-transport-udt-4.0.23.Final.208198c]
INFO [main] 2017-11-06 02:01:35,415 Server.java:194 - Starting listening for CQL clients on /0:0:0:0:0:0:0:0:9042...
INFO [main] 2017-11-06 02:01:35,487 ThriftServer.java:119 - Binding thrift service to /0:0:0:0:0:0:0:0:9160
ERROR [main] 2017-11-06 02:01:35,499 CassandraDaemon.java:583 - Exception encountered during startup
java.lang.RuntimeException: Unable to create thrift socket to /0:0:0:0:0:0:0:0:9160
at org.apache.cassandra.thrift.CustomTThreadPoolServer$Factory.buildTServer(CustomTThreadPoolServer.java:270) ~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.thrift.TServerCustomFactory.buildTServer(TServerCustomFactory.java:46) ~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.(ThriftServer.java:131) ~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.thrift.ThriftServer.start(ThriftServer.java:58) ~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.service.CassandraDaemon.start(CassandraDaemon.java:502) [apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:579) [apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:655) [apache-cassandra-2.1.15.jar:2.1.15]
Caused by: org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address /0:0:0:0:0:0:0:0:9160.
at org.apache.cassandra.thrift.TCustomServerSocket.(TCustomServerSocket.java:75) ~[apache-cassandra-2.1.15.jar:2.1.15]
at org.apache.cassandra.thrift.CustomTThreadPoolServer$Factory.buildTServer(CustomTThreadPoolServer.java:265) ~[apache-cassandra-2.1.15.jar:2.1.15]
... 6 common frames omitted
INFO [StorageServiceShutdownHook] 2017-11-06 02:01:35,531 Server.java:225 - Stop listening for CQL clients
INFO [StorageServiceShutdownHook] 2017-11-06 02:01:35,532 Gossiper.java:1454 - Announcing shutdown
INFO [StorageServiceShutdownHook] 2017-11-06 02:01:35,534 StorageService.java:1715 - Node /192.168.9.46 state jump to shutdown
INFO [RMI TCP Connection(12)-127.0.0.1] 2017-11-06 02:01:35,842 ThriftServer.java:119 - Binding thrift service to /0:0:0:0:0:0:0:0:9160
INFO [StorageServiceShutdownHook] 2017-11-06 02:01:37,554 MessagingService.java:734 - Waiting for messaging service to quiesce
INFO [ACCEPT-/192.168.9.46] 2017-11-06 02:01:37,555 MessagingService.java:1020 - MessagingService has terminated the accept() thread
It seems to generate these errors periodically.
The /var/log/clearwater-cluster-manager/cluster-manager.output.log gives no logs. The /var/log/clearwater-cluster-manager/cluster-manager_20171106T100000Z.txt gives the following logs:
06-11-2017 02:07:13.763 UTC INFO utils.py:310 (thread MainThread): Acquired exclusive lock on /var/run/clearwater-cluster-manager.pid.lockfile
06-11-2017 02:07:13.764 UTC INFO plugin_loader.py:26 (thread MainThread): Inspecting cassandra_plugin.py
06-11-2017 02:07:13.810 UTC INFO plugin_loader.py:34 (thread MainThread): Loading cassandra_plugin.py
06-11-2017 02:07:13.810 UTC INFO plugin_loader.py:36 (thread MainThread): Loaded cassandra_plugin.py successfully
06-11-2017 02:07:13.810 UTC INFO main.py:180 (thread MainThread): Loaded plugin <cassandra_plugin.CassandraPlugin object at 0x7fd647d5ec50>
06-11-2017 02:07:13.812 UTC INFO main.py:189 (thread MainThread): Started thread for plugin <cassandra_plugin.CassandraPlugin object at 0x7fd647d5ec50>
06-11-2017 02:07:13.815 UTC INFO common_etcd_synchronizer.py:301 (thread CassandraPlugin): Read value {"192.168.9.46": "joining, acknowledged change"} from etcd, comparing to last value None
06-11-2017 02:07:13.830 UTC INFO etcd_synchronizer.py:47 (thread CassandraPlugin): Got new state {"192.168.9.46": "joining, acknowledged change"} from etcd
06-11-2017 02:07:13.830 UTC INFO synchronization_fsm.py:108 (thread CassandraPlugin): Entered state machine for 192.168.9.46 with local state joining, acknowledged change, cluster state joining, config changing and cluster view {u'192.168.9.46': u'joining, acknowledged change'}
06-11-2017 02:07:13.830 UTC INFO synchronization_fsm.py:24 (thread CassandraPlugin): Calling plugin method CassandraPlugin.on_startup
06-11-2017 02:07:13.831 UTC INFO synchronization_fsm.py:24 (thread CassandraPlugin): Calling plugin method CassandraPlugin.on_joining_cluster
06-11-2017 02:07:13.846 UTC INFO alarms.py:312 (thread CassandraPlugin): Imported /usr/share/clearwater/bin/alarms.py
06-11-2017 02:07:15.951 UTC INFO cassandra_plugin.py:95 (thread CassandraPlugin): Cassandra seeds list is 192.168.9.46
06-11-2017 02:07:16.022 UTC INFO cassandra_plugin.py:120 (thread CassandraPlugin): Restarting Cassandra
06-11-2017 02:07:16.115 UTC ERROR plugin_utils.py:44 (thread CassandraPlugin): Command start-stop-daemon -K -p /var/run/cassandra/cassandra.pid -R TERM/30/KILL/5 failed with return code 1, stdout "No process in pidfile '/var/run/cassandra/cassandra.pid' found running; none killed.\n", and stderr 'start-stop-daemon: warning: failed to kill 238: No such process\n'
06-11-2017 02:07:16.115 UTC INFO cassandra_plugin.py:139 (thread CassandraPlugin): Stopped Cassandra while changing config files
06-11-2017 02:07:16.115 UTC WARNING cassandra_plugin.py:147 (thread CassandraPlugin): Deleting /var/lib/cassandra - this is normal on initial clustering
06-11-2017 02:07:16.388 UTC INFO cassandra_plugin.py:233 (thread CassandraPlugin): Waiting for Cassandra to come up...
06-11-2017 02:07:43.024 UTC INFO cassandra_plugin.py:241 (thread CassandraPlugin): Still waiting for Cassandra to come up...
06-11-2017 02:08:08.696 UTC INFO cassandra_plugin.py:241 (thread CassandraPlugin): Still waiting for Cassandra to come up...
06-11-2017 02:08:35.269 UTC INFO cassandra_plugin.py:241 (thread CassandraPlugin): Still waiting for Cassandra to come up...
Thanks, I am forward to your reply!
It looks like your pod is still not getting any ipv6 configuration.
You said that the output of ifconfig
looked the same as mine, but there's one crucial part missing, which is the ipv6 addresses.
You'll note that the ifconfig
output from my cassandra pod contained the following line for eth0:
inet addr:10.230.17.236 Bcast:0.0.0.0 Mask:255.255.255.0
and for the lo inteface:
inet6 addr: ::1/128 Scope:Host
Without any ipv6 configuration on the interfaces, Cassandra is unable to bind the thrift service to 0::0:9160
, which is why you're still seeing the log Exception encountered during startup java.lang.RuntimeException: Unable to create thrift socket to /0:0:0:0:0:0:0:0:9160
.
Can you confirm that the ipv6 sysctl setting has succeeded? You can check the current sysctl settings for it by running sudo sysctl -a | grep "net.ipv6.conf.all"
. We believe the disable_ipv6 flag must be set to zero. Also, just to check: In your reply above, you have written that you set "net.ipv6.conf.all.disable_ipv6=o", which is using the letter o
, not the digit zero 0
. I assume this is a typo, but if not, that may be the source of the problem.
@johadalin ,thank you very much for your reply. I can make sure I net.ipv6.conf.all.disable_ipv6=0, and the arbitrarily worker nodes command "sysctl -a | grep "net.ipv6.conf.all" gives:
sysctl: reading key "net.ipv6.conf.all.stable_secret"
net.ipv6.conf.all.accept_dad = 1
net.ipv6.conf.all.accept_ra = 1
net.ipv6.conf.all.accept_ra_defrtr = 1
net.ipv6.conf.all.accept_ra_from_local = 0
net.ipv6.conf.all.accept_ra_min_hop_limit = 1
net.ipv6.conf.all.accept_ra_mtu = 1
net.ipv6.conf.all.accept_ra_pinfo = 1
net.ipv6.conf.all.accept_ra_rt_info_max_plen = 0
net.ipv6.conf.all.accept_ra_rtr_pref = 1
net.ipv6.conf.all.accept_redirects = 1
net.ipv6.conf.all.accept_source_route = 0
net.ipv6.conf.all.autoconf = 1
net.ipv6.conf.all.dad_transmits = 1
net.ipv6.conf.all.disable_ipv6 = 0
net.ipv6.conf.all.drop_unicast_in_l2_multicast = 0
net.ipv6.conf.all.drop_unsolicited_na = 0
net.ipv6.conf.all.enhanced_dad = 1
net.ipv6.conf.all.force_mld_version = 0
net.ipv6.conf.all.force_tllao = 0
net.ipv6.conf.all.forwarding = 0
net.ipv6.conf.all.hop_limit = 64
net.ipv6.conf.all.ignore_routes_with_linkdown = 0
net.ipv6.conf.all.keep_addr_on_down = 0
net.ipv6.conf.all.max_addresses = 16
net.ipv6.conf.all.max_desync_factor = 600
net.ipv6.conf.all.mc_forwarding = 0
net.ipv6.conf.all.mldv1_unsolicited_report_interval = 10000
net.ipv6.conf.all.mldv2_unsolicited_report_interval = 1000
net.ipv6.conf.all.mtu = 1280
net.ipv6.conf.all.ndisc_notify = 0
net.ipv6.conf.all.proxy_ndp = 0
net.ipv6.conf.all.regen_max_retry = 3
net.ipv6.conf.all.router_probe_interval = 60
net.ipv6.conf.all.router_solicitation_delay = 1
net.ipv6.conf.all.router_solicitation_interval = 4
net.ipv6.conf.all.router_solicitation_max_interval = 3600
net.ipv6.conf.all.router_solicitations = -1
net.ipv6.conf.all.seg6_enabled = 0
net.ipv6.conf.all.seg6_require_hmac = 0
sysctl: reading key "net.ipv6.conf.default.stable_secret"
sysctl: reading key "net.ipv6.conf.docker0.stable_secret"net.ipv6.conf.all.suppress_frag_ndisc = 1
net.ipv6.conf.all.temp_prefered_lft = 86400
sysctl: reading key "net.ipv6.conf.ens3.stable_secret"
sysctl: reading key "net.ipv6.conf.lo.stable_secret"
net.ipv6.conf.all.temp_valid_lft = 604800
net.ipv6.conf.all.use_oif_addrs_only = 0
net.ipv6.conf.all.use_tempaddr = 2
Then I find out that the cassandra pod ifconfig gives:
eth0 Link encap:Ethernet HWaddr 0a:58:c0:a8:03:1d
inet addr:192.168.3.29 Bcast:0.0.0.0 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:649 errors:0 dropped:0 overruns:0 frame:0
TX packets:872 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:73108 (73.1 KB) TX bytes:82697 (82.6 KB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:975 errors:0 dropped:0 overruns:0 frame:0
TX packets:975 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:91172 (91.1 KB) TX bytes:91172 (91.1 KB)
I can not find the key item: inet6 addr: ::1/128 Scope:Host
So what is the problem, my container network is CNI by flanneld. Thank you very much, I am forward to your reply.
Hi @fogongzi , sorry for the delay.
The problem is down to how we're currently setting up our listen addresses in the cassandra yaml, i believe. The line in question i think is https://github.com/Metaswitch/clearwater-cassandra/blob/master/clearwater-cassandra/usr/share/clearwater/cassandra/cassandra.yaml.template#L444 , where we're setting it to ::0
. In the past, we used to have this as just 0.0.0.0
but that caused a separate set of issues in different conditions, which is why we changed it.
As it stands now, i don't have time to dig in and come up with the right fix for this, where we probably need to do some more research into the expected container environment.
You should be able to work around this for now though, by building your cassandra container image to replace the rpc_listen_address line. You should be able to do this by modifying the dockerfile to edit the cassandra yaml template. I'd suggest doing some testing with RUN sed -i "s/rpc_address: ::0/rpc_address: 0.0.0.0/" /usr/share/clearwater/cassandra/cassandra.yaml.template
at the end of the dockerfile. Hopefully that should get you up and running.
Let me know how it goes
@johadalin Excuse me for jumping in.
You should be able to work around this for now though, by building your cassandra container image to replace the rpc_listen_address line. You should be able to do this by modifying the dockerfile to edit the cassandra yaml template. I'd suggest doing some testing with
RUN sed -i "s/rpc_address: ::0/rpc_address: 0.0.0.0/" /usr/share/clearwater/cassandra/cassandra.yaml.template
at the end of the dockerfile. Hopefully that should get you up and running.
I haven't tested enough yet though, this workaround works fine for me.
Do you have any plan to add this command to cassandra/Dockerfile
on this repository?