mattshma/bigdata

Unable to mount volumes for pod XX. list of unattached/unmounted volumes=YYYY

mattshma opened this issue · 4 comments

查看原因,解释为:

When the PVC protection alpha feature is enabled, if a user deletes a PVC in active use by a pod, the PVC is not removed immediately. PVC removal is postponed until the PVC is no longer actively used by any pods.

查看 k8s log,如下:

Mar 22 14:34:52  kubelet[2715]: E0322 14:34:52.534778    2715 desired_state_of_world_populator.go:273] Error processing volume "jupyter" for pod "jupyter-2zvlc(a726c9c6-2d9a-11e8-b5f0-005056b76c14)": error processing PVC "k8s"/"jupyter-
 "k8s"/"jupyter": PVC k8s/jupyter has non-bound phase ("Pending") or empty pvc.Spec.VolumeName ("")
Mar 22 14:34:52  kubelet[2715]: E0322 14:34:52.734273    2715 desired_state_of_world_populator.go:273] Error processing volume "jupyter" for pod "jupyter_2zvlc(a726c9c6-2d9a-11e8-b5f0-005056b76c14)": error processing PVC "k8s"/"jupyter": PVC k8s/jupyter has non-bound phase ("Pending") or empty pvc.Spec.VolumeName ("")

在 K8S Master 查看信息:

# kubectl -n nAMESPACE describe pod POD_NAME
Events:
  Type     Reason                 Age   From                     Message
  ----     ------                 ----  ----                     -------
  Normal   Scheduled              2m    default-scheduler        Successfully assigned jupyter to xxxx
  Warning  FailedAttachVolume     2m    attachdetach-controller  Multi-Attach error for volume "jupyter" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedMount            2s    kubelet,xxx   Unable to mount volumes for pod "jupyter(df1d9172-2e48-11e8-ba93-005056b75104)": timeout expired waiting for volumes to attach/mount for pod "k8s"/"jupyter".

怀疑是 ceph 的问题,先查看 ceph 的lock:rbd lock list kube/jupyter,无输出,说明没lock。无头绪,再次查看 kubelet 的相关 log: journalctl -xe -u kubelet:

Mar 23 11:19:48 xxx kubelet[18992]: I0323 11:19:48.892975   18992 rbd_util.go:273] rbd image kube/jupyter still being used
Mar 23 11:19:48 xxx kubelet[18992]: E0323 11:19:48.893128   18992 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/rbd/[xxxxx]:jupyter\"" failed. No retries permitted until 2018-03-23 11:20:52.893058902 +0800 CST m=+50438.570203365 (durationBeforeRetry 1m4s). Error: "MountVolume.WaitForAttach failed for volume \"jupyter\" (UniqueName: \"kubernetes.io/rbd/[xxx]:jupyter\") pod \"jupyter-7bd54668c7-5496r\" (UID: \"df1d9172-2e48-11e8-ba93-005056b75104\") : rbd image kube/jupyter is still being used. rbd output: Watchers:\n\twatcher=xxxxx:0/3785365512 client.20052960 cookie=18446462598732840961\n"

找到关键证据!查看 rados watcher:

$ rbd info kube/jupyter
rbd image 'jupyter':
	size 30720 MB in 7680 objects
	order 22 (4096 kB objects)
	block_name_prefix: rbd_data.1217303d4abff7
	format: 2
	features: layering
	flags:
// rbd_header 值为 rbd_data 后的数字
$ rados listwatchers -p kube rbd_header.1217303d4abff7
watcher=10.10.18.30:0/3785365512 client.20052960 cookie=18446462598732840961
$ ceph osd blacklist ls
listed 0 entries
$ ceph osd blacklist add 10.10.18.30:0/3785365512
blacklisting 10.10.18.30:0/3785365512 until 2018-03-23 14:02:15.706177 (3600 sec)
$ ceph osd blacklist ls
listed 1 entries
10.10.18.30:0/3785365512 2018-03-23 14:02:15.706177
$ rados listwatchers -p kube rbd_header.1217303d4abff7
$ ceph osd blacklist rm 10.10.18.30:0/3785365512
un-blacklisting 10.10.18.30:0/3785365512
$ ceph osd blacklist ls
listed 0 entries
$ rados listwatchers -p kube rbd_header.1217303d4abff7

执行完上面操作,再次启动容器,成功!

UPDATE:故障可能原因二

按以上操作均无效。偶然看到之前在其他机器上有 mount 过该 rbd image 的操作,若能 umount 掉,则umount。否则可以尝试 unmap 该 rbd image,我在 unmap 时出错,发现之前 mount 的命令已经死掉了,无奈重启机器后解决问题。

UPDATE:故障可能原因三

对应的宿主机上该 image 没 map。执行 sudo rbd map IMAGE -p POOLNAME后,可使用。

报错:

 timeout expired waiting for volumes to attach/mount for pod xxxx. list of unattached/unmounted volumes=[xxxxx]

rbd showmapped 查看对应的目录,然后执行 fcsk /dev/rbdN,磁盘报错,通过sudo e2fsck -y /dev/rbdN 进行修复。

若 e2fcsk 修复时间太久,原因是该目录下文件太大了,可以先 mount 将大文件备份下来,然后再执行 e2fcsk 修复:

mount /dev/rbdN /mnt
mv /mnt/BIG_FILE /bak_dir