How to mount GPU devices correctly in nsjail?
radkris-git opened this issue · 0 comments
radkris-git commented
Hi, I'm trying to run a simple "pytorch tensor add" on GPU under nsjail on a GCP nvidia-tesla-t4
node and i'm getting the following error.
nsjail_pytorch.cfg
mount {
src: "/home/current_user_ldap/pytorch_env"
dst: "/home/current_user_ldap/pytorch_env"
is_bind: true
}
mount {
src: "/dev/nvidia0"
dst: "/dev/nvidia0"
is_bind: true
rw: true
}
mount {
src: "/dev/nvidiactl"
dst: "/dev/nvidiactl"
is_bind: true
rw: true
}
mount {
src: "/dev/nvidia-uvm"
dst: "/dev/nvidia-uvm"
is_bind: true
rw: true
}
mount {
src: "/usr"
dst: "/usr"
is_bind: true
rw: true
}
# for libs
mount {
src: "/lib64"
dst: "/lib64"
is_bind: true
}
mount {
src: "/lib"
dst: "/lib"
is_bind: true
rw: true
}
cwd: "/home/current_user_ldap/pytorch_env/"
Running simple PyTorch Tensor Add on CPU works.
nsjail -Mo --chroot / --rlimit_nproc 6553 --rlimit_fsize inf --rlimit_as inf -- /usr/bin/python3 -c "import torch; a = torch.tensor([1.0, 2.0], device='cpu') + torch.tensor([3.0, 4.0], device='cpu'); print(a)"
This prints the expected tensor output of [4, 6]
Running simple PyTorch Tensor Add on GPU fails
nsjail -Mo --config nsjail_pytorch.cfg --chroot / --rlimit_nproc 6553 --rlimit_fsize inf --rlimit_as inf -- /usr/bin/python3 -c "import torch; print(torch.cuda.is_available());"
[I][2024-08-10T02:03:04+0000] Mode: STANDALONE_ONCE
[I][2024-08-10T02:03:04+0000] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/usr/bin/python3', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:600, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2024-08-10T02:03:04+0000] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/home/current_user_ldap/pytorch_env' -> '/home/current_user_ldap/pytorch_env' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/dev/nvidia0' -> '/dev/nvidia0' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:false
[I][2024-08-10T02:03:04+0000] Mount: '/dev/nvidiactl' -> '/dev/nvidiactl' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:false
[I][2024-08-10T02:03:04+0000] Mount: '/dev/nvidia-uvm' -> '/dev/nvidia-uvm' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:false
[I][2024-08-10T02:03:04+0000] Mount: '/usr' -> '/usr' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/lib64' -> '/lib64' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/lib' -> '/lib' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Uid map: inside_uid:1002 outside_uid:1002 count:1 newuidmap:false
[I][2024-08-10T02:03:04+0000] Gid map: inside_gid:1003 outside_gid:1003 count:1 newgidmap:false
[I][2024-08-10T02:03:06+0000] Executing '/usr/bin/python3' for '[STANDALONE MODE]'
/home/current_user_ldap/.local/lib/python3.9/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
[I][2024-08-10T02:03:08+0000] pid=28434 ([STANDALONE MODE]) exited with status: 0, (PIDs left: 0)
NVIDIA-SMI runs fine under nsjail
nsjail -Mo --config nsjail_pytorch.cfg --chroot / --rlimit_nproc 6553 --rlimit_as inf -- /bin/nvidia-smi
The above prints, the actual nvidia-smi output successfully.
Notes
- PyTorch works fine under nsjail (No issues)
- nvidia-smi works under nsjail
- Running PyTorch without nsjail on GPU succeeds.
This doesn't look like pytorch or the host issue provided pytorch works on GPU without nsjail. Any help appreciated.