NVIDIA/enroot

Critical issue; invalid parameter potential leading to loss of files

Closed this issue · 5 comments

On our Slurm cluster we use pyxis and are very happy with it. We have it setup without automatic HOME mount (if this has any influence on the current issue at all).

Recently we had a user who initiated the command

srun --container-image=<image url> --container-save="./" bash -c '<command>'

from his home folder and managed to delete all of his existing files.

A quick review of pyxis code indicates that this may be because, based on the documentation, it is simply assumed that the user will supply a file path, as instructed, not a folder path. Could something in the likes of

	if (optarg[strlen(optarg)-1] == '/') {
		slurm_error("pyxis: --container-save: argument must be file, not path!")
		return (-1)
	}

be added to https://github.com/NVIDIA/pyxis/blob/424563865a491f2d8f3c396a095d18db81f1fb07/args.c#L385

@3XX0 for review, I think maybe we can fix that in enroot, by changing this line:

common::rmall "${filename}"

@itzsimpl if you manually apply the patch 2bd5143 then this issue should not be possible anymore.
Sorry about that!

@flx42 thank you for the patch, I can confirm that the loss of data does not occur anymore. The downside of the resolution being pushed to Enroot is that now all the "work" done inside the container is lost. In other words, if Pyxis handles the case, the user will get notified before the job even starts; if Enroot handles the case, the user will update the container, but on exit it will fail on save. From the UX perspective this may be worth to consider.

A simple test.

$ srun -p dev --container-image=ubuntu:20.04 --container-save=./ ls
srun: job 8203 queued and waiting for resources
srun: job 8203 has been allocated resources
pyxis: importing docker image: ubuntu:20.04
pyxis: imported docker image: ubuntu:20.04
bin
boot
dev
etc
home
lib
lib32
lib64
libx32
local
media
mnt
opt
proc
root
run
sbin
shared
srv
sys
tmp
usr
var
slurmstepd: error: pyxis: child 220192 failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     rm: cannot remove '/shared/home/ilb/test-remove': Is a directory
slurmstepd: error: pyxis: failed to export container pyxis_8203_8203.0 to ./
slurmstepd: error: spank: required plugin spank_pyxis.so: task_exit() failed with rc=-1
slurmstepd: error: Unable to spank task 0 at exit

Yes I can also add a check to pyxis, but it will not catch all cases, like this: --container-save=/home/flx42

You're right, trying to handle all possible cases might be an overkill. Thank you for the quick fix.