ceph/ceph-helm

Unable to mount volumes : timeout expired waiting for volumes to attach/mount

feresberbeche opened this issue ยท 2 comments

Is this a request for help?: Yes


Is this a BUG REPORT or FEATURE REQUEST? Bug report

Version of Helm and Kubernetes:

kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"} 
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.7", GitCommit:"dd5e1a2978fd0b97d9b78e1564398aeea7e7fe92", GitTreeState:"clean", BuildDate:"2018-04-18T23:58:35Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"} 
helm version                                                                                                                                     root@kubernetes
Client: &version.Version{SemVer:"v2.9.1", GitCommit:"20adb27c7c5868466912eebdf6664e7390ebe710", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.9.1", GitCommit:"20adb27c7c5868466912eebdf6664e7390ebe710", GitTreeState:"clean"}

Which chart: ceph-helm

What happened:

Unable to mount volumes for pod "mypod_default(e68c8e3e-6578-11e8-87c4-e83935e84dc8)": timeout expired waiting for volumes to attach/mount for pod "default"/"mypod". list of unattached/unmounted volumes=[vol1]

How to reproduce it (as minimally and precisely as possible):
http://docs.ceph.com/docs/master/start/kube-helm/

Anything else we need to know:

The ceph cluster is working fine

  ceph -s
  cluster:
    id:     88596d9e-b478-47a9-8208-3a6cea33d1d4
    health: HEALTH_OK
 
  services:
    mon: 1 daemons, quorum kubernetes
    mgr: kubernetes(active)
    mds: cephfs-1/1/1 up  {0=mds-ceph-mds-5696f9df5d-jbsgz=up:active}
    osd: 1 osds: 1 up, 1 in
    rgw: 1 daemon active
 
  data:
    pools:   7 pools, 176 pgs
    objects: 213 objects, 3391 bytes
    usage:   108 MB used, 27134 MB / 27243 MB avail
    pgs:     176 active+clean

Everything in th ceph namespace works fine
In the mon pod I got an image created for the pvc

rbd ls
kubernetes-dynamic-pvc-0077fdf9-6578-11e8-b1f8-b63c3e9e1eaa
kubectl get pvc                                                                                                                                  root@kubernetes
NAME                  STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ceph-pvc              Bound     pvc-c9d07cf9-6578-11e8-87c4-e83935e84dc8   1Gi        RWO            ceph-rbd       29m

I have changed resolv.conf and added the kube-dns as nameserver, I can resolve
ceph-mon.ceph and ceph-mon.ceph.svc.local from the host node

some kubelet logs that I found related
juin 01 11:24:19 kubernetes kubelet[32612]: E0601 11:24:19.587800 32612 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/rbd/[ceph-mon.ceph.svc.cluster.local:6789]:kubernetes-dynamic-pvc-0077fdf9-6578-11e8-b1f8-b63c3e9e1eaa\"" failed. No retries permitted until 2018-06-01 11:24:51.582365588 +0200 CEST m=+162261.330642194 (durationBeforeRetry 32s). Error: "MountVolume.WaitForAttach failed for volume \"pvc-004d66b7-6578-11e8-87c4-e83935e84dc8\" (UniqueName: \"kubernetes.io/rbd/[ceph-mon.ceph.svc.cluster.local:6789]:kubernetes-dynamic-pvc-0077fdf9-6578-11e8-b1f8-b63c3e9e1eaa\") pod \"ldap-ss-0\" (UID: \"f63432e0-6579-11e8-87c4-e83935e84dc8\") : error: exit status 1, rbd output: 2018-06-01 11:19:19.513914 7f1cf1f227c0 -1 did not load config file, using default settings.\n2018-06-01 11:19:19.579955 7f1cf1f20700 0 -- IP@:0/1002573 >> IP@:6789/0 pipe(0x3a2a3f0 sd=3 :53578 s=1 pgs=0 cs=0 l=1 c=0x3a2e6e0).connect protocol feature mismatch, my 83ffffffffffff < peer 481dff8eea4fffb missing 400000000000000\n2018-06-01 11:19:19.580065 7f1cf1f20700 0 -- IP@:0/1002573 >> IP@:6789/0 pipe(0x3a2a3f0 sd=3 :53578 s=1 pgs=0 cs=0 l=1 c=0x3a2e6e0).fault\n2018-06-01 11:19:19.580437 7f1cf1f20700 0 -- IP@:0/1002573 >> 10.1.0.146:6789/0 pipe(0x3a2a3f0 sd=3 :53580 s=1 pgs=0 cs=0 l=1 c=0x3a2e6e0).connect protocol feature mismatch, my 83ffffffffffff < peer 481dff8eea4fffb missing 400000000000000\n2018-06-01 11:19:19.781427 7f1cf1f20700 0 -- 10.1.0.146:0/1002573 >> 10.1.0.146:6789/0 pipe(0x3a2a3f0 sd=3 :53584 s=1 pgs=0 cs=0 l=1 c=0x3a2e6e0).**connect protocol feature mismatch**, my 83ffffffffffff < peer 481dff8eea4fffb missing 400000000000000\n2018-06-01 11:19:20.182401 7f1cf1f20700 0 -- 10.1.0.146:0/1002573 >> 10.1.0.146:6789/0 pipe(0x3a2a3f0 sd=3 :53588 s=1 pgs=0 cs=0 l=1 c=0x3a2e6e0).**connect protocol feature mismatch**, my 83ffffffffffff < peer 481dff8eea4fffb missing 400000000000000\n2018-06-01 11:19:20.983428 7f1cf1f20700 0 -- IP@:0/1002573 >> ip@:6789/0 pipe(0x3a2a3f0 sd=3 :53610 s=1 pgs=0 cs=0 l=1 c=0x3a2e6e0).conne

Idon't know it tries to connect to my kubernetes node externalip:6789 that port is only opened to the ceph-mon headless svc which is

kubectl get svc -n ceph                                                                                                                    root@kubernetes
NAME       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
ceph-mon   ClusterIP   None            <none>        6789/TCP   1h

From the kubernetes node I can telnet to the port 6789

telnet ceph-mon.ceph 6789                                                                                                                  root@kubernetes 
Trying IP@ ... 
Connected to ceph-mon.ceph. 

connect protocol feature mismatch in the kubelet logs
Could have something to do with

Important Kubernetes uses the RBD kernel module to map RBDs to hosts. Luminous requires CRUSH_TUNABLES 5 (Jewel). The minimal kernel version for these tunables is 4.5. If your kernel does not support these tunables, run ceph osd crush tunables hammer

in the ceph-helm doc

and yes it was that you only need to run
ceph osd crush tunables hammer
on the ceph-mon pod.
I will leave this here if anybody else had the same issue ๐Ÿ˜ƒ
/close

Hello @feresberbeche , Thank you for this , very helpful. I was stuck because of my kernel version 4.4.0...
The upgrade has solved everything.

Details versions:
CEPH:
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)

Kubernetes:
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:53:20Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:43:26Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Linux Kernel:
4.15.0-30-generic