Volume mounting issues with helm deployment
Opened this issue · 2 comments
- after successful deploy. Another pair of
ckan
andjobs
pods are created that hang there forever - However there are 2 working pods for both of them... and both of them doing fine in logs
kubectl get pods -n aa
NAME READY STATUS RESTARTS AGE
ckan-54854bd485-fx6j9 0/1 Init:0/2 0 18m
ckan-54f97b4cf4-fszlr 1/1 Running 0 5m38s
jobs-6d54db954d-tmgs8 0/1 Init:0/2 0 18m
jobs-db-7b7657ff88-gsj9n 1/1 Running 0 39m
nginx-6465d756b9-qsbdp 1/1 Running 0 39m
redis-c8c6ff95-zj68h 1/1 Running 0 39m
- Both have nearly similar logs from
kubect describe
ckan
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 18m default-scheduler Successfully assigned aa/jobs-6d54db954d-tmgs8 to aks-default-44369806-vmss000001
Warning FailedAttachVolume 18m attachdetach-controller Multi-Attach error for volume "pvc-a90fe26a-517c-11ea-9a7d-fe823327918f" Volume is already used by pod(s) ckan-54f97b4cf4-pbgrd, ckan-54854bd485-p98sl
Warning FailedMount 50s (x8 over 16m) kubelet, aks-default-44369806-vmss000001 Unable to mount volumes for pod "jobs-6d54db954d-tmgs8_aa(9f4d85bf-517f-11ea-9a7d-fe823327918f)": timeout expired waiting for volumes to attach or mount for pod "aa"/"jobs-6d54db954d-tmgs8". list of unmounted volumes=[ckan-data]. list of unattached volumes=[ckan-conf-secrets ckan-conf-templates ckan-data ckan-aa-operator-token-kgz4x]
Jobs
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 18m default-scheduler Successfully assigned aa/ckan-54854bd485-fx6j9 to aks-default-44369806-vmss000000
Warning FailedMount 48s (x8 over 16m) kubelet, aks-default-44369806-vmss000000 Unable to mount volumes for pod "ckan-54854bd485-fx6j9_aa(9f343bf5-517f-11ea-9a7d-fe823327918f)": timeout expired waiting for volumes to attach or mount for pod "aa"/"ckan-54854bd485-fx6j9". list of unmounted volumes=[ckan-data]. list of unattached volumes=[ckan-conf-secrets ckan-conf-templates ckan-data ckan-aa-operator-token-kgz4x]
- Not sure bad gateway on https://aa.viderum.xyz/ and above is related. But I can see nothing from logs of working
ckan
pod. Part of the logs from there
2020-02-17 12:34:17,459 INFO [pyutilib.component.core.pca] [MainThread] Removing service SchemingGroupsPlugin from environment pca
2020-02-17 12:34:17,459 INFO [pyutilib.component.core.pca] [MainThread] Removing service SchemingGroupsPlugin from environment pca
2020-02-17 12:34:17,459 INFO [pyutilib.component.core.pca] [MainThread] Removing service SchemingGroupsPlugin from environment pca
2020-02-17 12:34:17,459 INFO [pyutilib.component.core.pca] [MainThread] Removing service SchemingGroupsPlugin from environment pca
2020-02-17 12:34:17,459 INFO [pyutilib.component.core.pca] [MainThread] Removing service PagesPluginBase from environment pca
2020-02-17 12:34:17,459 INFO [pyutilib.component.core.pca] [MainThread] Removing service PagesPluginBase from environment pca
2020-02-17 12:34:18,681 INFO [ckan.config.environment] [MainThread] Loading templates from /usr/lib/ckan/src/ckan/ckan/templates
From @akariv
it seems that there is a volume mounting problem, right?
CCO starts ckan in two phases - first one just to init stuff and then a full deployment with replication.
it looks like the 2nd deployment can't start because the volumes are already mounted with the first deployment
and the first deployment won't end for some reason (perhaps it's waiting for the second deployment to get to ready state, causing a deadlock?)
I'm guessing there's something you can change in the graceful death period setting in the deployment that might fix this behaviour
This seems to be resolved after scaling down and up the deployment. Ttried manually and it worked, probably we should make this part of cco.
kubectl scale -n namespace deployment <deployment> --replicas=0
kubectl scale -n namespace deployment <deployment> --replicas=1
kubectl -n namespace get deployments
But unfortunately, this did not solve the bed gateway issue. CKAN pod is stuck to start and the app is not listening on 5000