containers/podman

Failing to run MPI Podman example on HPC cluster with subuid/subgid mapping restrictions

secondspass opened this issue ยท 19 comments

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

We're trying to test out Podman on one of our HPC clusters at ORNL, and we are setting up rootless Podman to do it. For security and administrative reasons, we can't set up and maintain the userid mappings in /etc/subuid and /etc/subgid. We've gotten to the point where we can get a single container running or multiple containers running as a job with Slurm started with mpirun. But these are containers that are separate from each other i.e. not talking to each other. Now we're trying to figure out how to do MPI where the containers are talking to each other. I'm following along with this tutorial. However, I am getting the error seen in the 'Describe your results' section. This is similar to this other Github issue but there their issue seems to be resolved with setting up the mapping in the /etc/subuid /etc/subgid files. However, that is not a mapping that we can maintain due to the administrative overhead of having to maintain a mapping for hundreds of users on each of the nodes in the cluster (plus adding new ones all the time). And providing setuid capabilities for the newuidmap and newgidmap is something our security folks have pushed back against.

Is there a way to provide the MPI functionality with rootless Podman with these restrictions in mind? And if you can point me to other centers that have deployed Podman successfully with these restrictions, that would also be helpful.

For reference, the program I am testing with is an MPI Ring program, and the following command in the job script to run, similar to what is described in the tutorial blog post.

mpirun -np 4 podman -v --cgroup-manager=cgroupfs run --userns=keep-id --env-host -v /tmp/podman_mpi_tmp:/tmp/podman_mpi_tmp --net=host --pid=host --ipc=host localhost/centosmpi /home/mpi_ring

Steps to reproduce the issue:

Describe the results you received:

time="2020-12-02T17:27:19-05:00" level=error msg="cannot find UID/GID for user subil: No subuid ranges found for user \"subil\" in /etc/subuid - check rootless mode in man pages."
time="2020-12-02T17:27:19-05:00" level=error msg="cannot find UID/GID for user subil: No subuid ranges found for user \"subil\" in /etc/subuid - check rootless mode in man pages."
time="2020-12-02T17:27:19-05:00" level=error msg="cannot find UID/GID for user subil: No subuid ranges found for user \"subil\" in /etc/subuid - check rootless mode in man pages."
time="2020-12-02T17:27:19-05:00" level=error msg="cannot find UID/GID for user subil: No subuid ranges found for user \"subil\" in /etc/subuid - check rootless mode in man pages."
time="2020-12-02T17:27:19-05:00" level=error msg="cannot find UID/GID for user subil: No subuid ranges found for user \"subil\" in /etc/subuid - check rootless mode in man pages."
time="2020-12-02T17:27:19-05:00" level=error msg="cannot find UID/GID for user subil: No subuid ranges found for user \"subil\" in /etc/subuid - check rootless mode in man pages."
Error: chown /run/user/15377/containers/overlay-containers/6cd5784303d85eb01cfc931102243fb640fc2cb139e349cb395d2040641d3ef9/userdata: invalid argument
Error: chown /run/user/15377/containers/overlay-containers/ec5cb0449b5bb835803a18b93c9205767bc96e153c07e9749cdb277d4c109fe2/userdata: invalid argument
Error: chown /run/user/15377/containers/overlay-containers/c690c88eda9a75076e4371a9d46b0e181ec4ad0775c46e4fe11e8d44f1d7666d/userdata: invalid argument
time="2020-12-02T17:27:19-05:00" level=error msg="cannot find UID/GID for user subil: No subuid ranges found for user \"subil\" in /etc/subuid - check rootless mode in man pages."
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[28556,1],0]
Exit code: 125
--------------------------------------------------------------------------

Describe the results you expected:
Proper output of the mpi ring program with four processes

Process 1 received token -1 from process 0
Process 2 received token -1 from process 1
Process 3 received token -1 from process 2
Process 0 received token -1 from process 3

Additional information you deem important (e.g. issue happens only occasionally):

We are setting ignore_chown_errors to true in the storage.conf

# storage.conf
[storage]
driver = "overlay"
graphroot = "/tmp/subil-containers"
#rootless_storage_path = "$HOME/.local/share/containers/storage"
rootless_storage_path = "/tmp/subil-containers-storage"

[storage.options]
additionalimagestores = [
]

[storage.options.overlay]
ignore_chown_errors = "true"
mount_program = "/usr/bin/fuse-overlayfs"
mountopt = "nodev,metacopy=on"

[storage.options.thinpool]

The dockerfile for the localhost/centosmpi image

FROM centos:8

RUN yum -y install openmpi-devel
ENV PATH="/usr/lib64/openmpi/bin:$PATH"
ENV LD_LIBRARY_PATH="/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH"


COPY mpi_ring /home/mpi_ring

Output of podman version:

Version: 2.0.2
API Version: 1
Go Version: go1.13.4
Built: Wed Dec 31 19:00:00 1969
OS/Arch: linux/amd64

Output of podman info --debug:


host:
  arch: amd64
  buildahVersion: 1.15.0
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.19-1.el8.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.19, commit: 4726cba6f219c7479c28f0687868bd2ffe894869'
  cpus: 32
  distribution:
    distribution: '"rhel"'
    version: "8.1"
  eventLogger: file
  hostname: andes-login1
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 27008
      size: 1
    uidmap:
    - container_id: 0
      host_id: 15377
      size: 1
  kernel: 4.18.0-147.8.1.el8_1.x86_64
  linkmode: dynamic
  memFree: 235466366976
  memTotal: 270055858176
  ociRuntime:
    name: crun
    package: crun-0.14.1-1.el8.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 0.14.1
      commit: 88886aef25302adfd40a9335372bbc2b970c8ae5
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/user/15377/podman/podman.sock
  rootless: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.4-1.el8.x86_64
    version: |-
      slirp4netns version 1.1.4
      commit: b66ffa8e262507e37fca689822d23430f3357fe8
      libslirp: 4.2.0
      SLIRP_CONFIG_VERSION_MAX: 2
  swapFree: 0
  swapTotal: 0
  uptime: 800h 24m 25.67s (Approximately 33.33 days)
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - registry.centos.org
  - docker.io
store:
  configFile: /autofs/nccs-svm1_home1/subil/.config/containers/storage.conf
  containerStore:
    number: 14
    paused: 0
    running: 0
    stopped: 14
  graphDriverName: overlay
  graphOptions:
    overlay.ignore_chown_errors: "true"
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-1.1.2-1.el8.x86_64
      Version: |-
        fusermount3 version: 3.2.1
        fuse-overlayfs: version 1.1.0
        FUSE library version 3.2.1
        using FUSE kernel interface version 7.26
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /tmp/subil-containers-storage
  graphStatus:
    Backing Filesystem: tmpfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 5
  runRoot: /run/user/15377/containers
  volumePath: /tmp/subil-containers-storage/volumes
version:
  APIVersion: 1
  Built: 0
  BuiltTime: Wed Dec 31 19:00:00 1969
  GitCommit: ""
  GoVersion: go1.13.4
  OsArch: linux/amd64
  Version: 2.0.2

Package info (e.g. output of rpm -q podman or apt list podman):

podman-2.0.2-2.el8.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

No

Additional environment details (AWS, VirtualBox, physical, etc.):

Using the Andes HPC cluster at ORNL.

% cat /etc/*release
NAME="Red Hat Enterprise Linux"
VERSION="8.1 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.1"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.1 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.1:GA"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.1
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.1"
Red Hat Enterprise Linux release 8.1 (Ootpa)
Red Hat Enterprise Linux release 8.1 (Ootpa)

Let me know if there is any additional information you need me to provide. I wrote what I thought was the most relevant information.

can you try without the --userns=keep-id option?

If you are running with a single user available you are kind of forced to run as root in the container (in theory the OCI runtime could support this case, but it is not yet)

So, without the --userns=keep-id option, I get this:

time="2020-12-08T10:38:00-05:00" level=error msg="cannot find UID/GID for user subil: No subuid ranges found for user \"subil\" in /etc/subuid - check rootless mode in man pages."
time="2020-12-08T10:38:00-05:00" level=error msg="cannot find UID/GID for user subil: No subuid ranges found for user \"subil\" in /etc/subuid - check rootless mode in man pages."
[andes3.olcf.ornl.gov:00687] PMIX ERROR: ERROR in file gds_dstore.c at line 1244
[andes3.olcf.ornl.gov:00687] PMIX ERROR: OUT-OF-RESOURCE in file gds_dstore.c at line 1017
[andes3.olcf.ornl.gov:00687] PMIX ERROR: OUT-OF-RESOURCE in file gds_dstore.c at line 2234
[andes3.olcf.ornl.gov:00604] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1758
[andes3.olcf.ornl.gov:00687] PMIX ERROR: INVALID-CREDENTIAL in file ptl_tcp.c at line 685
[andes3.olcf.ornl.gov:00687] OPAL ERROR: Unreachable in file ext2x_client.c at line 112
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[andes3.olcf.ornl.gov:00687] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[andes3.olcf.ornl.gov:00697] PMIX ERROR: ERROR in file gds_dstore.c at line 1244
[andes3.olcf.ornl.gov:00697] PMIX ERROR: OUT-OF-RESOURCE in file gds_dstore.c at line 1017
[andes3.olcf.ornl.gov:00697] PMIX ERROR: OUT-OF-RESOURCE in file gds_dstore.c at line 2234
[andes3.olcf.ornl.gov:00604] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1758
[andes3.olcf.ornl.gov:00697] PMIX ERROR: INVALID-CREDENTIAL in file ptl_tcp.c at line 685
[andes3.olcf.ornl.gov:00697] OPAL ERROR: Unreachable in file ext2x_client.c at line 112
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[andes3.olcf.ornl.gov:00697] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
time="2020-12-08T10:38:01-05:00" level=error msg="Error forwarding signal 18 to container 55ff6624300bb89bd974d3ce25c3db408b445ba618854e5f8b086b2c0d4593c9: can only kill running containers. 55ff6624300bb89bd974d3ce25c3db40
8b445ba618854e5f8b086b2c0d4593c9 is in state stopped: container state improper"
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[31459,1],0]
  Exit code:    1
--------------------------------------------------------------------------

I am trying to see if there is a workaround for that. I've tried directly mounting the cluster's MPI libraries (along with other locations that the cluster's libraries depend on) and adding them to the path inside the container, see if that would work. So something like

 mpirun --mca pml ob1 -np 2 podman -v --cgroup-manager=cgroupfs run --env-host \
-v /sw:/sw \
-v /opt/mellanox:/opt/mellanox \
-v /usr/lib64:/usr/lib64 \
-v /tmp/podman_mpi_tmp:/tmp/podman_mpi_tmp --net=host --pid=host --ipc=host localhost/centosmpi /home/mpi_ring

But that just gave similar looking but different errors

time="2020-12-08T15:51:17-05:00" level=error msg="cannot find UID/GID for user subil: No subuid ranges found for user \"subil\" in /etc/subuid - check rootless mode in man pages."
[andes71.olcf.ornl.gov:03442] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[andes71.olcf.ornl.gov:03450] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[andes71.olcf.ornl.gov:03364] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1758
[andes71.olcf.ornl.gov:03442] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[andes71.olcf.ornl.gov:03364] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1758
[andes71.olcf.ornl.gov:03450] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[andes71.olcf.ornl.gov:03450] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[andes71.olcf.ornl.gov:03442] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
time="2020-12-08T15:51:17-05:00" level=error msg="Error forwarding signal 18 to container ab6f6b44029cceb9c302817aa9fd804c2210f88f26b8485cc897f1179a04b89f: can only kill running containers. ab6f6b44029cceb9c302817aa9fd804c2210f88f26b8485cc897f1179a04b89f is in state stopped: container state improper"
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[18830,1],0]

A colleague suggested compiling OpenMPI from source inside the container with the appropriate flags because it could be something to do with incompatibility with our Slurm scheduler. So that is something I might try, see if that makes a difference. I don't know if you have any insight here?

I don't know mpirun to understand what is going on here.

Is there any way to enable more logging? What is /home/mpi_ring? Should it be changed since now it runs as root (inside the user namespace)?

I'm just guessing, maybe --allow-run-as-root is needed?

$ podman run --rm -ti localhost/openmpi mpirun --help | grep -B2 -A4 allow
We strongly suggest that you run mpirun as a non-root user.

You can override this protection by adding the --allow-run-as-root option
to the cmd line or by setting two environment variables in the following way:
the variable OMPI_ALLOW_RUN_AS_ROOT=1 to indicate the desire to override this
protection, and OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 to confirm the choice and
add one more layer of certainty that you want to do so.
$ 

I built localhost/openmpi from almost the same Dockerfile as in #8580 (comment)

FROM centos:8
RUN yum -y install openmpi-devel
ENV PATH="/usr/lib64/openmpi/bin:$PATH"
ENV LD_LIBRARY_PATH="/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH"

In the Singularity repo there is an example where --allow-run-as-root is used

$ cd ~/git-repos/singularity
$ git remote -v
origin	git@github.com:hpcng/singularity.git (fetch)
origin	git@github.com:hpcng/singularity.git (push)
$ grep -r mpirun .
./examples/legacy/2.2/contrib/centos7-ompi_master.def:    /usr/local/bin/mpirun --allow-run-as-root /usr/bin/mpi_ring
$

so if I understand it correctly, you'd like to run in an environment where a single ID is available (i.e. no subuid/subgid available) but would like to run the container as non root user.

Could you give a try to #8693 and containers/crun#556 ?

@eriksjolund I can verify that mpirun works fine inside the container with --allow-run-as-root i.e. podman run localhost/centosmpi /bin/sh -c "mpirun -np 4 --allow-run-as-root /home/mpi_ring". But in that case, we are starting the container which has four MPI processes inside itself. The work inside the container is isolated. If I start multiple containers, each container is isolated from each other and not talking to each other. What I am trying to do is to start multiple containers and have them all talk to each other with MPI, each container running a single MPI process i.e. mpirun -np 4 podman --cgroup-manager=cgroupfs run --env-host -v /tmp/podman_mpi_tmp:/tmp/podman_mpi_tmp --userns=keep-id --net=host --pid=host --ipc=host localhost/centosmpi /home/mpi_ring. Here, mpirun is starting 4 podman containers (as opposed to the container starting mpirun inside itself). Each container is running 1 MPI process, and the processes in the different containers need to talk to each other. This is what the aforementioned tutorial has achieved, multiple containers communicating with each other over MPI.

@giuseppe Your understanding is correct. I want to run containers as non root where I don't have subuid/subgid mappings, and be able to start them with mpirun and have them talk to each other over MPI. I can look at getting more logs, see if that provides me a clue. And thank you for the suggested links, I will look into those and let you know if I make progress.

I can tell you that I did test with subuid/subgid mappings present in the cluster, just to see if it would work, and it did, just like in the tutorial. The containers are able to talk to each other and the mpi_ring program runs (see below for mpi_ring explanation). So it looks like MPI needs the subuid/subgid mappings to allow MPI communications between the containers. I'm trying to find a way to achieve that same result without the subuid/subgid mappings.

To explain mpi_ring, that is just a sample MPI program (source code). In it, if you start it with N MPI processes, process 0 sends a message to process 1, which sends that message to process 2, which sends it to 3, and so on until process N-1 sends that message to 0. My goal is to be able to start N containers, one MPI process per container, and get that output where the message is being sent from one container to another (with the cluster restrictions that I have already talked about). There is nothing user or root specific in the program itself, so there is nothing to change there.

I apologise for the somewhat erratic response time. I'm trying to do a lot of things at once ๐Ÿ˜…

@giuseppe will #8693 and containers/crun#556 be part of the next releases of Podman and crun? I don't know what your release cadence is so I'll avoid trying out building and installing from source if you were going to package it up soon anyway.

@giuseppe will #8693 and containers/crun#556 be part of the next releases of Podman and crun? I don't know what your release cadence is so I'll avoid trying out building and installing from source if you were going to package it up soon anyway.

yes they will be part of the next release. It would be useful to get some early feedback though. If there is still something not working we can address it faster

The next release of Podman will not be until mid January, at least. We plan on releasing podman 3.0 at that time. crun could be released earlier though.

Alright cool! I'll keep you updated. We've built it from source just fine and I'll be testing it soon. Thanks for the help thus far!

@giuseppe After building and installing Podman and crun with those new changes, my MPI example works! mpirun is able to start multiple containers and the MPI program is actually able to communicate across containers, even if the containers are on different nodes. I modified the call to include the --user and --uidmap flags like so

mpirun --mca pml ob1 --mca btl "tcp,self" -np 4 podman --cgroup-manager=cgroupfs run --env-host -v /tmp/podman_mpi_tmp:/tmp/podman_mpi_tmp --user=<my uid>:<my gid> --uidmap 0:0:1 --net=host --pid=host --ipc=host localhost/centosmpi /home/mpi_ring

And it produced the correct output without needing the subuid/subgid mappings. I don't know what kind of namespace wizardry you did but it worked. Thank you so much! Currently, the mpi_ring example is a fairly simple example, so we will be testing more with more complex examples as well as trying to get proper scheduler integration. But this was a huge step forward!

Can you explain what you did in the changes you made? What are the changes doing and what do the --user and --uidmap flags actually do? The help page only has very brief explanations. It would help me to understand what you did to make Podman and crun to behave differently.

Can you explain what you did in the changes you made? What are the changes doing and what do the --user and --uidmap flags actually do? The help page only has very brief explanations. It would help me to understand what you did to make Podman and crun to behave differently.

now crun creates another inner user namespace when you provide a configuration like: --uidmap 0:0:1 --user 1000:1000. After creating the usernamespace you've asked for --uidmap 0:0:1 and performing all the container configuration there, just before launching the container process, crun creates another user namespace where it maps to the user you've specified.

So what really happens is UID on the Host -> mapped to root in the first user namespace -> root mapped to the UID you've specified

A friendly reminder that this issue had no activity for 30 days.

since the issue is solved, I am going to close it.

Please feel free to reopen if it still doesn't work

Sorry for the lack of response! I will update as we do more testing.

qhaas commented

I decided to revisit rootless podman without subuid/subgid.

Does podman 4.4 require subuid/subgid to be set, as implied by the documentation?
Rootless Podman requires the user running it to have a range of UIDs listed in the files /etc/subuid and /etc/subgid.

Below is the default in a fresh, but updated Rocky Linux 9.2 deployment with the podman stack installed via dnf install container-tools and no subuid/subgid set for the user:

$ podman info
host:
  arch: amd64
  buildahVersion: 1.29.0
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-1.el9_2.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: e6cdc9a4d6319e039efa13e532c1e58b713c904d'
  cpuUtilization:
    idlePercent: 95.79
    systemPercent: 0.66
    userPercent: 3.54
  cpus: 2
  distribution:
    distribution: '"rocky"'
    version: "9.2"
  eventLogger: journald
  hostname: podman
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 100
      size: 1
    uidmap:
    - container_id: 0
      host_id: 16642
      size: 1
  kernel: 5.14.0-284.18.1.el9_2.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 3205722112
  memTotal: 4108505088
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.8.4-1.el9_2.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.4
      commit: 5a8fa99a5e41facba2eda4af12fa26313918805b
      rundir: /run/user/16642/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/user/16642/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-3.el9.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 0
  swapTotal: 0
  uptime: 0h 8m 53.00s
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /home/nqh/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/nqh/.local/share/containers/storage
  graphRootAllocated: 30099963904
  graphRootUsed: 2264072192
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 0
  runRoot: /run/user/16642/containers
  transientStore: false
  volumePath: /home/nqh/.local/share/containers/storage/volumes
version:
  APIVersion: 4.4.1
  Built: 1683632637
  BuiltTime: Tue May  9 11:43:57 2023
  GitCommit: ""
  GoVersion: go1.19.6
  Os: linux
  OsArch: linux/amd64
  Version: 4.4.1

A simple podman pull results in the following error and no image pulled:

$ podman pull docker.io/rockylinux:9
Trying to pull docker.io/library/rockylinux:9...
Getting image source signatures
Copying blob 1a5eb4db1701 done  
Error: copying system image from manifest list: writing blob: adding layer with blob "sha256:1a5eb4db170197a9e7a55f381fc96161adb457b9147b368ea5e8064b786165b8": processing tar file(potentially insufficient UIDs or GIDs available in user namespace (requested 0:5 for /usr/bin/write): Check /etc/subuid and /etc/subgid if configured locally and run podman-system-migrate: lchown /usr/bin/write: invalid argument): exit status 1

$ podman images
REPOSITORY TAG...

did you set ignore_chown_errors?

qhaas commented

did you set ignore_chown_errors?

Good call. Setting that to ignore_chown_errors = "true" in storage.conf did the trick! Looks like podman 4.4 defaults to crun and other configuration settings needed to run a container rootless without subuid/subgid now.

$ podman pull docker.io/rockylinux:9
Trying to pull docker.io/library/rockylinux:9...
Getting image source signatures
Copying blob 1a5eb4db1701 done  
Copying config eeea865f41 done  
Writing manifest to image destination
Storing signatures
eeea865f4111bd48e16801554f44adf2db2fa4cb87a98ff7470d6de6be49fc15
$ podman run --rm docker.io/rockylinux:9 cat /etc/redhat-release
Rocky Linux release 9.2 (Blue Onyx

In my opinion, ignore_chown_errors needs to be mentioned in the rootless tutorial, looks like it is already mentioned elsewhere, and the description embedded in the config file is excellent:

$ cat /etc/containers/storage.conf
...
[storage.options.overlay]
# ignore_chown_errors can be set to allow a non privileged user running with
# a single UID within a user namespace to run containers. The user can pull
# and use any image even those with multiple uids.  Note multiple UIDs will be
# squashed down to the default uid in the container.  These images will have no
# separation between the users in the container. Only supported for the overlay
# and vfs drivers.
...