google/nsjail

How to mount GPU devices correctly in nsjail?

radkris-git opened this issue · 0 comments

Hi, I'm trying to run a simple "pytorch tensor add" on GPU under nsjail on a GCP nvidia-tesla-t4 node and i'm getting the following error.

nsjail_pytorch.cfg

mount {
  src: "/home/current_user_ldap/pytorch_env"
  dst: "/home/current_user_ldap/pytorch_env"
  is_bind: true
}
mount {
  src: "/dev/nvidia0"
  dst: "/dev/nvidia0"
  is_bind: true
  rw: true
}
mount {
  src: "/dev/nvidiactl"
  dst: "/dev/nvidiactl"
  is_bind: true
  rw: true
}
mount {
  src: "/dev/nvidia-uvm"
  dst: "/dev/nvidia-uvm"
  is_bind: true
  rw: true
}
mount {
  src: "/usr"
  dst: "/usr"
  is_bind: true
  rw: true
}
# for libs
mount {
  src: "/lib64"
  dst: "/lib64"
  is_bind: true
}
mount {
  src: "/lib"
  dst: "/lib"
  is_bind: true
  rw: true
}
cwd: "/home/current_user_ldap/pytorch_env/"

Running simple PyTorch Tensor Add on CPU works.

nsjail -Mo --chroot /   --rlimit_nproc 6553   --rlimit_fsize inf --rlimit_as inf   -- /usr/bin/python3 -c "import torch; a = torch.tensor([1.0, 2.0], device='cpu') + torch.tensor([3.0, 4.0], device='cpu'); print(a)" 

This prints the expected tensor output of [4, 6]

Running simple PyTorch Tensor Add on GPU fails

nsjail -Mo --config nsjail_pytorch.cfg  --chroot /  --rlimit_nproc 6553   --rlimit_fsize inf --rlimit_as inf    -- /usr/bin/python3 -c "import torch; print(torch.cuda.is_available());"
[I][2024-08-10T02:03:04+0000] Mode: STANDALONE_ONCE
[I][2024-08-10T02:03:04+0000] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/usr/bin/python3', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:600, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2024-08-10T02:03:04+0000] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/home/current_user_ldap/pytorch_env' -> '/home/current_user_ldap/pytorch_env' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/dev/nvidia0' -> '/dev/nvidia0' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:false
[I][2024-08-10T02:03:04+0000] Mount: '/dev/nvidiactl' -> '/dev/nvidiactl' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:false
[I][2024-08-10T02:03:04+0000] Mount: '/dev/nvidia-uvm' -> '/dev/nvidia-uvm' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:false
[I][2024-08-10T02:03:04+0000] Mount: '/usr' -> '/usr' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/lib64' -> '/lib64' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/lib' -> '/lib' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Uid map: inside_uid:1002 outside_uid:1002 count:1 newuidmap:false
[I][2024-08-10T02:03:04+0000] Gid map: inside_gid:1003 outside_gid:1003 count:1 newgidmap:false
[I][2024-08-10T02:03:06+0000] Executing '/usr/bin/python3' for '[STANDALONE MODE]'
/home/current_user_ldap/.local/lib/python3.9/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False
[I][2024-08-10T02:03:08+0000] pid=28434 ([STANDALONE MODE]) exited with status: 0, (PIDs left: 0)

NVIDIA-SMI runs fine under nsjail

nsjail -Mo --config nsjail_pytorch.cfg  --chroot /  --rlimit_nproc 6553 --rlimit_as inf   -- /bin/nvidia-smi

The above prints, the actual nvidia-smi output successfully.

Notes

  • PyTorch works fine under nsjail (No issues)
  • nvidia-smi works under nsjail
  • Running PyTorch without nsjail on GPU succeeds.

This doesn't look like pytorch or the host issue provided pytorch works on GPU without nsjail. Any help appreciated.