mattshma/bigdata

因文件系统损坏导致无法启动容器

mattshma opened this issue · 1 comments

在启动容器时报错如下:

Exec lifecycle hook ([/bin/sh -c /post_start.sh]) for Container "tensorflow-cpu" in Pod "tensorflow-cpu(042d3a04-c3bd-11e8-bc0a-005056b73f59)" failed - error: command '/bin/sh -c /post_start.sh' exited with 1: Could not open requirements file: [Errno 2] No such file or directory: '/home/jovyan/work/requirements.txt' rm: cannot remove '/home/jovyan/work/requirements.txt': No such file or directory, message: "Could not open requirements file: [Errno 2] No such file or directory: '/home/jovyan/work/requirements.txt'\nrm: cannot remove '/home/jovyan/work/requirements.txt': No such file or directory\n"
Exec lifecycle hook ([/bin/sh -c /pre_stop.sh]) for Container "tensorflow-cpu" in Pod "tensorflow-cpu(042d3a04-c3bd-11e8-bc0a-005056b73f59)" failed - error: command '/bin/sh -c /pre_stop.sh' exited with 1: /pre_stop.sh: line 5: /home/jovyan/work/requirements.txt: File exists, message: "/pre_stop.sh: line 5: /home/jovyan/work/requirements.txt: File exists\n"

看报错:因 requirements.txt 不存在而导致容器启动失败,不过在关闭时,又报错该文件已存在了。很奇怪,于是在相应宿主机上 cd 到目标目录下查看文件:

[root@host1 tensorflow-cpu]# ll
ls: cannot access requirements.txt: No such file or directory
total 24
-????????? ? ?   ?         ?            ? requirements.txt

很明显,该文件系统不能识别了。

解决方法,本来想尝试修复下文件系统,结果报错:

# e2fsck /dev/rbd18
e2fsck 1.42.9 (28-Dec-2013)
/dev/rbd18 is in use.
e2fsck: Cannot continue, aborting.

参考 #122 ,将其 unmap 后再 map,不用修复文件系统,启动容器即可:

# rbd unmap -o force /dev/rbd18
2018-09-29 16:04:56.132548 7faf58fd5d80 -1 did not load config file, using default settings.
# rbd map k8s/tensorflow-cpu