mattshma/bigdata

Orphaned pod found, but volume paths are still present on disk

mattshma opened this issue · 10 comments

机器因故自动重启后,Kubelet 等都启动正常,不过无法获取该机器上的 gpu 信息。查看 kubelet log,有如下报错:

E0606 14:57:30.245413    3284 kubelet.go:1275] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
E0606 14:57:30.257693    3284 kubelet.go:1333] Failed to start gpuManager stat /dev/nvidiactl: no such file or directory
E0606 14:57:30.257973    3284 container_manager_linux.go:583] [ContainerManager]: Fail to get rootfs information unable to find data for container /
E0606 14:57:31.258134    3284 container_manager_linux.go:583] [ContainerManager]: Fail to get rootfs information unable to find data for container /
E0606 14:57:32.258269    3284 container_manager_linux.go:583] [ContainerManager]: Fail to get rootfs information unable to find data for container /
E0606 14:57:33.258393    3284 container_manager_linux.go:583] [ContainerManager]: Fail to get rootfs information unable to find data for container /
E0606 14:57:34.258637    3284 container_manager_linux.go:583] [ContainerManager]: Fail to get rootfs information unable to find data for container /
E0606 14:57:35.459204    3284 reconciler.go:376] Could not construct volume information: Volume: "kubernetes.io/rbd/[]:" is not mounted
E0606 14:57:35.459280    3284 reconciler.go:376] Could not construct volume information: Volume: "kubernetes.io/rbd/[]:" is not mounted
E0606 14:57:36.283582    3284 kubelet_volumes.go:128] Orphaned pod "2ec25470-6929-11e8-a2cf-005056b75104" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.
E0606 14:57:38.262971    3284 kubelet_volumes.go:128] Orphaned pod "2ec25470-6929-11e8-a2cf-005056b75104" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them..

由于该机器上的 pod 等都没在使用,所以相关的 volume 信息应该删除:

$ sudo systemctl stop kubelet kube-proxy
$ rm -rf /var/lib/k8s/kubelet/pods/2ec25470-6929-11e8-a2cf-005056b75104
$ sudo systemctl start kubelet kube-proxy

恢复正常。

回溯了下整个经过,2ec25470-6929-11e8-a2cf-005056b75104 这个 pod 因故未能在这台机器启动,该机器的 K8S 未能成功回收 pod,导致 K8S 崩溃,进而影响该机器上的其他实例。pod 未成功启动的原因没查到。

@mattshma 最好不要这么干,因为如果有pvc的话,rm -rf pod会顺便吧pvc里面的内容干掉,回收pod不是这么干的。

@NightmareZero 删除 POD 确实不是这么删除的。不过我这里删除的是 Orphaned pod,如果不这么删,请问下是否还有其他更好的删除方法么?如果有的话麻烦告知下,谢谢~

只是提醒一下,先umount,我是被坑惨了,数据库里面的数据都清空了

@NightmareZero 额,我存储使用的 rbd,直接删除的话,数据在 rbd 还是有的,没出现你说的这种情况。

我用的也是rbd,用rm -rf的话,会导致rbd mount出来的设备删不掉,但是里面的内容被删干净的情况
你可以做做试验在/mnt/test/t1挂个设备,然后rm -rf /mnt/test试试,t1里面的数据会被清空的

@NightmareZero mount上去再 rm -rf,当然会删除,这个我清楚。我理解你的意思了,可能当时我的 rbd 在某步已经 umount 了。多谢指教!

想请假下,你有做过多节点挂载存储么?

做了,不过用的不是rbd,用的是cephfs

哦,这样。

@mattshma 最好不要这么干,因为如果有pvc的话,rm -rf pod会顺便吧pvc里面的内容干掉,回收pod不是这么干的。

赞同你的观点:)