NVIDIA/enroot

not possible to run enroot start when operating system is running on rootfs (stateless server boot)

Opened this issue · 2 comments

If you run a stateless cluster (such as one deployed by warewulf) with root filesystem in RAM, for example:

root@node1:/tmp# df -h /
Filesystem      Size  Used Avail Use% Mounted on
rootfs         1001G   16G  985G   2% /
root@node1:/tmp# mount | grep rootfs
rootfs on / type rootfs (rw,size=1048948424k,nr_inodes=262237106,inode64)
root@node1:/tmp# 

You cannot start enroot containers. This happens:

root@node1:/tmp# enroot start raf-ssd
enroot-switchroot: failed to switch root: /raid/enroot/raf-ssd: Invalid argument
root@node1:/tmp# 

strace snippet:

pivot_root(".", ".")                    = -1 EINVAL (Invalid argument)

There seems to be a hard requirement for enroot to do a pivot_root syscall:

if ((int)syscall(SYS_pivot_root, ".", ".") < 0)

Unfortunately pivot_root is not supported by stateless/memory based root disk.

The nvidia-container-cli binary provides a flag to --no-pivot, presumably this works for docker... but there is no equivalent for enroot.

root@node1:/tmp# nvidia-container-cli --help | grep pivot
  -n, --no-pivot             Do not use pivot_root
root@node1:/tmp#
3XX0 commented

Yeah we don't support doing this for now. It should be fairly straightforward to change though.

Hi, same error here. In my case I've modified the file enroot-switchroot.c changing the pivot_root value for a chroot, that makes enroot works good, but it also does pyxis fail with the following:

$ srun --container-image=ubuntu grep PRETTY /etc/os-release
srun: job 14791 queued and waiting for resources
srun: job 14791 has been allocated resources
pyxis: importing docker image: ubuntu
pyxis: imported docker image: ubuntu
PRETTY_NAME="Rocky Linux 8.8 (Green Obsidian)"

It imports the container image but does not chroot inside it.