datopian/ckan-cloud-operator

Volume mounting issues with helm deployment

Opened this issue · 2 comments

  • after successful deploy. Another pair of ckan and jobs pods are created that hang there forever
  • However there are 2 working pods for both of them... and both of them doing fine in logs
kubectl get pods -n aa
NAME                       READY   STATUS     RESTARTS   AGE
ckan-54854bd485-fx6j9      0/1     Init:0/2   0          18m
ckan-54f97b4cf4-fszlr      1/1     Running    0          5m38s
jobs-6d54db954d-tmgs8      0/1     Init:0/2   0          18m
jobs-db-7b7657ff88-gsj9n   1/1     Running    0          39m
nginx-6465d756b9-qsbdp     1/1     Running    0          39m
redis-c8c6ff95-zj68h       1/1     Running    0          39m
  • Both have nearly similar logs from kubect describe
    ckan
Events:
  Type     Reason              Age                From                                      Message
  ----     ------              ----               ----                                      -------
  Normal   Scheduled           18m                default-scheduler                         Successfully assigned aa/jobs-6d54db954d-tmgs8 to aks-default-44369806-vmss000001
  Warning  FailedAttachVolume  18m                attachdetach-controller                   Multi-Attach error for volume "pvc-a90fe26a-517c-11ea-9a7d-fe823327918f" Volume is already used by pod(s) ckan-54f97b4cf4-pbgrd, ckan-54854bd485-p98sl
  Warning  FailedMount         50s (x8 over 16m)  kubelet, aks-default-44369806-vmss000001  Unable to mount volumes for pod "jobs-6d54db954d-tmgs8_aa(9f4d85bf-517f-11ea-9a7d-fe823327918f)": timeout expired waiting for volumes to attach or mount for pod "aa"/"jobs-6d54db954d-tmgs8". list of unmounted volumes=[ckan-data]. list of unattached volumes=[ckan-conf-secrets ckan-conf-templates ckan-data ckan-aa-operator-token-kgz4x]

Jobs

Events:
  Type     Reason       Age                From                                      Message
  ----     ------       ----               ----                                      -------
  Normal   Scheduled    18m                default-scheduler                         Successfully assigned aa/ckan-54854bd485-fx6j9 to aks-default-44369806-vmss000000
  Warning  FailedMount  48s (x8 over 16m)  kubelet, aks-default-44369806-vmss000000  Unable to mount volumes for pod "ckan-54854bd485-fx6j9_aa(9f343bf5-517f-11ea-9a7d-fe823327918f)": timeout expired waiting for volumes to attach or mount for pod "aa"/"ckan-54854bd485-fx6j9". list of unmounted volumes=[ckan-data]. list of unattached volumes=[ckan-conf-secrets ckan-conf-templates ckan-data ckan-aa-operator-token-kgz4x]
  • Not sure bad gateway on https://aa.viderum.xyz/ and above is related. But I can see nothing from logs of working ckan pod. Part of the logs from there
2020-02-17 12:34:17,459 INFO  [pyutilib.component.core.pca] [MainThread] Removing service SchemingGroupsPlugin from environment pca
2020-02-17 12:34:17,459 INFO  [pyutilib.component.core.pca] [MainThread] Removing service SchemingGroupsPlugin from environment pca
2020-02-17 12:34:17,459 INFO  [pyutilib.component.core.pca] [MainThread] Removing service SchemingGroupsPlugin from environment pca
2020-02-17 12:34:17,459 INFO  [pyutilib.component.core.pca] [MainThread] Removing service SchemingGroupsPlugin from environment pca
2020-02-17 12:34:17,459 INFO  [pyutilib.component.core.pca] [MainThread] Removing service PagesPluginBase from environment pca
2020-02-17 12:34:17,459 INFO  [pyutilib.component.core.pca] [MainThread] Removing service PagesPluginBase from environment pca
2020-02-17 12:34:18,681 INFO  [ckan.config.environment] [MainThread] Loading templates from /usr/lib/ckan/src/ckan/ckan/templates

From @akariv
it seems that there is a volume mounting problem, right?
CCO starts ckan in two phases - first one just to init stuff and then a full deployment with replication.
it looks like the 2nd deployment can't start because the volumes are already mounted with the first deployment
and the first deployment won't end for some reason (perhaps it's waiting for the second deployment to get to ready state, causing a deadlock?)
I'm guessing there's something you can change in the graceful death period setting in the deployment that might fix this behaviour

This seems to be resolved after scaling down and up the deployment. Ttried manually and it worked, probably we should make this part of cco.

kubectl scale -n namespace deployment <deployment> --replicas=0
kubectl scale -n namespace deployment <deployment> --replicas=1
kubectl -n namespace get deployments

But unfortunately, this did not solve the bed gateway issue. CKAN pod is stuck to start and the app is not listening on 5000