google/nsjail

Disable creation of cgroups in jail

Dragoncraft89 opened this issue · 1 comments

If /sys/fs/cgroup is mounted inside the jail, the application can create new cgroups. Even though they cannot be entered, this could lead to resource exhaustion on the kernel's cgroups.

Example:

$ ./nsjail --chroot / -- /bin/bash
[I][2022-07-29T14:15:12+0200] Mode: STANDALONE_ONCE
[I][2022-07-29T14:15:12+0200] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/bin/bash', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-07-29T14:15:12+0200] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2022-07-29T14:15:12+0200] Mount: '/proc' flags:MS_RDONLY type:'proc' options:'' dir:true
[I][2022-07-29T14:15:12+0200] Uid map: inside_uid:1000 outside_uid:1000 count:1 newuidmap:false
[I][2022-07-29T14:15:12+0200] Gid map: inside_gid:1000 outside_gid:1000 count:1 newgidmap:false
[I][2022-07-29T14:15:12+0200] Executing '/bin/bash' for '[STANDALONE MODE]'
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

$ mkdir /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/mycgroup

$ ls /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/mycgroup   
cgroup.controllers  cgroup.max.descendants  cgroup.type     memory.events        memory.min        memory.swap.current  pids.events
cgroup.events       cgroup.procs            cpu.pressure    memory.events.local  memory.numa_stat  memory.swap.events   pids.max
cgroup.freeze       cgroup.stat             cpu.stat        memory.high          memory.oom.group  memory.swap.high
cgroup.kill         cgroup.subtree_control  io.pressure     memory.low           memory.pressure   memory.swap.max
cgroup.max.depth    cgroup.threads          memory.current  memory.max           memory.stat       pids.current

$ # However, cannot be entered
$ echo $$ > /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/mycgroup/cgroup.procs 
bash: echo: write error: No such file or directory

My current workaround would be not to mount /sys/fs/cgroup into the jail, but maybe this could be prevented by default?

If you drop the --chroot / this should not be possible anymore. The issue is that the way you configure nsjail exposes the entire root directory in the sandbox with the same UID/GID configuration.