containers/podman

Rootless podman using --device and --group-add keep-groups not working as expected

KCSesh opened this issue · 15 comments

Description

I am trying to understand how to properly use --device in a rootless podman container.
Currently, when I added a device to the rootless container I see that the device is owned by: nobody nogroup

$ ls -la /dev/
...
crw-rw----  1 nobody nogroup 505,  1 Apr 26 18:32 nvhost-as-gpu
...

I have seen this on the troubleshooting: https://github.com/containers/podman/blob/master/troubleshooting.md#20-passed-in-device-cant-be-accessed-in-rootless-container

But this is only a solition for crun is there one for runc?
I have pulled the latest podman and have attempted to use:
http://docs.podman.io/en/latest/markdown/podman-run.1.html#device-host-device-container-device-permissions

--group-add keep-groups

But this does not seem to change behavior, I still see the device is owned by: nobody nogroup

I believe this issue is preventing me from accessing my GPU in a rootless container.
See here if you want specific details: NVIDIA/nvidia-container-runtime#85 (comment)

What are my options? Do I need to migrate to crun? Will that work? Should this be working with runc and --group-add?

Steps to reproduce the issue:

  1. podman run -it --device </dev/some-mnt>:</dev/some-mnt> --group-add keep-groups

  2. $ ls -la /dev

  3. Output will show device is owned by nobody nogroup

  4. I have also tried with --group-add video with no luck either.

Describe the results you received:

$ ls -la /dev/
...
crw-rw----  1 nobody nogroup 505,  1 Apr 26 18:32 nvhost-as-gpu
...

Describe the results you expected:
I would expect to be able to see the video group.

$ ls -la /dev/
...
crw-rw----  1 nobody video 505,  1 Apr 26 18:32 nvhost-as-gpu
...

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

$ podman --version
podman version 3.2.0-dev

Output of podman info --debug:

podman --storage-driver=vfs --root /data/podman-root/ --runroot /data/podman-run-root/ info --debug
host:
  arch: arm64
  buildahVersion: 1.20.1-dev
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: Unknown
    path: /usr/local/libexec/podman/conmon
    version: 'conmon version 2.0.28-dev, commit: 3770524c7d9c95fe703460a9168350ee5db7be03'
  cpus: 8
  distribution:
    distribution: tegra-ubuntu
    version: "18.04"
  eventLogger: file
  hostname: ubuntu
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1001
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 4.9.140
  linkmode: dynamic
  memFree: 27120275456
  memTotal: 33338081280
  ociRuntime:
    name: runc
    package: 'runc: /usr/sbin/runc'
    path: /usr/sbin/runc
    version: 'runc version spec: 1.0.1-dev'
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_AUDIT_WRITE,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_MKNOD,CAP_NET_BIND_SERVICE,CAP_NET_RAW,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    selinuxEnabled: false
  slirp4netns:
    executable: /data/downloads/slirp4netns/slirp4netns
    package: Unknown
    version: |-
      slirp4netns version 1.1.9
      commit: 4e37ea557562e0d7a64dc636eff156f64927335e
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.3.3
  swapFree: 16669016064
  swapTotal: 16669016064
  uptime: 45h 22m 54.28s (Approximately 1.88 days)
registries:
  search:
  - docker.io
  - registry.fedoraproject.org
  - registry.access.redhat.com
store:
  configFile: /home/<username>/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: vfs
  graphOptions: {}
  graphRoot: /data/podman-root
  graphStatus: {}
  imageStore:
    number: 0
  runRoot: /data/podman-run-root
  volumePath: /data/podman-root/volumes
version:
  APIVersion: 3.2.0-dev
  Built: 1619474073
  BuiltTime: Mon Apr 26 21:54:33 2021
  GitCommit: 2039be00d12afaab84659619c47a463cacb039f5
  GoVersion: go1.16
  OsArch: linux/arm64
  Version: 3.2.0-dev

Package info (e.g. output of rpm -q podman or apt list podman):

I built podman from source for ubuntu 18.04 on ARM

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):
Physical

mheon commented

The runc OCI runtime does not support the annotation required for retaining groups, and I see no indication that this will change in the near future. I suggest you switch to crun if you require it.

There has been an open PR for many months on this, but no movement.

So is --group-add keep-groups specific for crun? @mheon @rhatdan

mheon commented

Yes.

I tested swapping out to "crun" and this actually worked!
Which allows me to have GPU support in rootless! This is very exciting!

A slight note, the trouble shooting page still says:
--annotation io.crun.keep_original_groups=1

But it should be:
--annotation run.oci.keep_original_groups=1

See here for details: #4477

But also --group-add keep-groups worked which is nice, I just had to pull mainline for it.

I do have 2 followup questions as well.

  1. I tried adding the group-add video myself, but this was not enough. It does not detect the GPU. Is there somewhere in my container I can see the groups that were kept/mapped from the host when I add --group-add keep-groups?

  2. What does keeping the original groups mean from a security perspective?
    Is it giving the container more privilege somehow? I mean it must, because I can now access my GPU.
    I have read this: https://www.redhat.com/sysadmin/supplemental-groups-podman-containers
    But that doesn't really answer the question I am asking.
    Essentially what is the difference to running a rootless container with out keeping the groups vs running a rootless container keeping the groups?

And thanks again @mheon and @rhatdan for your help \o/.

mheon commented

Basically, the annotation is causing the OCI runtime to skip one of the normal steps of setting up a container, which involves dropping additional group memberships. I'm actually writing a blog that includes many details on this at the moment.

mheon commented

It does definitely increase the privileges allowed to the container - the container process, if it breaks out of the container, now has access to the groups of the user that launched Podman, which could potentially include important ones (wheel, for example)

But note, this is only for Group Access via GID. SELinux, Dropped Capabilities, User Namespace, SECCOMP are still in effect. So taking advantage of WHEEL from the perspective of sudo access, is still going to be blocked.
Bottom line is if SELinux does not block access, to a file that is only readable/writable via supplimental group access and the container breaks out, then the container process would be able to read/write this file. But if a containerized process breaks out to your homedir, it most likely already has the ability to read/write everything in $HOME (Luckily SELinux blocks almost all of this access).

Not that this needs to remain open, but is there a way to see how the groups are 'kept' and where they are mapped?
So if I wanted to do this myself I could?

mheon commented

@vrothberg Is this something that psgo does (or could do)?

This would seem like a good job for psgo.

--hgroups

To rephrase my question, because I don't need to view the mappings per se. (Though it would be nice)

Essentially, is there a way I can map the groups myself with podman?
Meaning my understanding is that I needed the video group to get access to my GPU.
When I add --group-add keep-groups it works because per my understanding it is correctly mapping the video group.

However, when I tried to do --group-add video the container starts but I do not have access to my GPU, with my best guess being that I am missing an important mapping step?
So I am wondering how I can do this without using --group-add keep-groups and control the mapping myself?
OR is this the only way it will work using: --group-add keep-groups ?

When you do --group-add video, it is adding the video group defined inside of the container image, to the primary process of the container.

grep video /etc/group
video:x:39:

So now inside of the container the process will have group 39, BUT this is not the same as group 39 on the host. When running rootless containers you are using user namespace, so that the group is offset by the usernamespace you have joined.

$ podman unshare cat /proc/self/gid_map
0 3267 1
1 100000 65536

Which means that the video group inside of the container is going to be GID 100038 on the host.

ctr=$(podman run -d --group-add video fedora sleep 100)
pid=$(podman top -l hpid | tail -1)
grep Groups /proc/$pid/status
Groups:	100038 

In order to access the video device on the host the process needs GID=39, so it fails. When you run with --group-add keep-groups, the oci container runtime (crun), does not call the setgroups call, so the new container process, maintains the groups of it's parent process. If the parent process had access to GID=39, the processes inside of the container will maintain still have that GID. Note that inside of the container the GID 39 is not mapped, so the processes within the container will see this as the nobody group.

./bin/podman run --group-add keep-groups fedora groups
root nobody

Sorry for asking in an already closed issue, but I cannot find more information about this.

Does keep-groups keep all extra groups? Or is there a limit?