replicatedhq/kots

Postgres fails to start when Kubernetes doesn't chown volume upon mount

gabegorelick opened this issue · 3 comments

If the Postgres data directory is not owned by the same user ID as the postgres process (currently 999 [1]), kotsadm-postgres-0 will crash with the following error:

FATAL: data directory "/var/lib/postgresql/data/pgdata" has wrong ownership
HINT: The server must be started by the user that owns the data directory.

With persistent volumes backed my most block storage devices, Kubernetes will recursively call chown and chmod on the mounted files and directories inside the volume, so the Postgres data directory will have the correct owner. But Kubernetes will not call chown/chmod for all volume types. The details are best summarized in this blog post:

Traditionally if your pod is running as a non-root user (which you should), you must specify a fsGroup inside the pod’s security context so that the volume can be readable and writable by the Pod.
...
But one side-effect of setting fsGroup is that, each time a volume is mounted, Kubernetes must recursively chown() and chmod() all the files and directories inside the volume - with a few exceptions noted below.
...
For certain multi-writer volume types, such as NFS or Gluster, the cluster doesn’t perform recursive permission changes even if the pod has a fsGroup. Other volume types may not even support chown()/chmod(), which rely on Unix-style permission control primitives.

All this means that if you use something like NFS for your persistent volumes, Kubernetes won't [2] call chown when the volume is mounted. Thus, Postgres will fail to start.

The traditional solution to this is to run chown on the relevant directories before Postgres starts, e.g. via an entrypoint or some other process pre-mounting the volume and running chown. But to do this, you'd have to be able to edit kotsadm's Postgres deployment, and kots doesn't really make it possible to edit its resources before installing. Instead, all you can do is patch the postgres deployment manually after kotsadm is installed.

A better solution may be for kots to chown the Postgres data directory itself on startup.

[1]

RunAsUser: util.IntPointer(999),
FSGroup: util.IntPointer(999),

[2] At some point, Kubernetes will stabilize the fsGroupPolicy interface and CSI drivers can opt in to this behavior explicitly. But even then, enabling that on the default storage class just so kotsadm's Postgres works probably isn't the best solution.

Here's what seems to be working for me. After installing kotsadm, wait for kotsadm-postgres to fail. Then, patch it (kubectl patch statefulset kotsadm-postgres) with the following patch. EDIT: updated to work across pod restarts.

spec:
  template:
    spec:
      securityContext:
        # Run as root so we can create user to match NFS UID. We will drop permissions later on.
        fsGroup: 0
        runAsUser: 0
      containers:
        - name: kotsadm-postgres
          command: ['/bin/bash']
          args:
            - '-x'
            - '-c'
            - |
              # This should match mountPath of the container
              mountpath=/var/lib/postgresql/data

              # Path to file that tells us whether DB has already been initialized
              kotsadm_initialized="$mountpath/kotsadm_initialized"

              # Grab group ID and user ID of the mounted volume.
              # Many storage implementations will generate these every time the volume is mounted.
              gid="$(stat -c '%g' "$mountpath")"
              uid="$(stat -c '%u' "$mountpath")"

              # User to run postgres as. When restarting this pod, a pguser account may already exist.
              pguser="pgdataowner$uid"

              if ! id "$pguser" &> /dev/null; then
                echo "Adding user $pguser as $uid:$gid"
                groupadd --system --gid "$gid" "$pguser"
                useradd --system --uid "$uid" -g "$pguser" --shell /usr/sbin/nologin "$pguser"
              fi

              if [ ! -e "$kotsadm_initialized" ]; then
                # Delete half-initialized data from last time this container ran initdb.
                # We want docker-entry-point.sh to rerun initdb with the correct owner.
                rm -rf "$mountpath"/*

                # Don't delete data next time this pod restarts.
                touch "$kotsadm_initialized"
              fi

              # Run regular entrypoint as our custom user,
              # with extra logging so that we can confirm it's initialized everything correctly
              gosu "$pguser:$pguser" bash -x docker-entrypoint.sh postgres

This effectively forces postgres to run as the UID that owns the mount directory. While chowning the mount directory to the postgres UID (999) would be simpler, that doesn't work on NFS shares that have root squashing enabled.

Once the StatefulSet is patched, you'll then have to delete the existing kotsadm-postgres-0 pod due to how forced rollback works.

After kotsadm-postgres is online, the kotsadm-migrations pod that was failing during all of this should finally finish successfully, although I'm not sure if there are timeouts that would prevent that.

One last note, it looks like there may be plans to change the postgres container to alpine. That would probably require tweaking this patch. Although since we're already patching the StatefulSet, we can always substitute in whatever Docker image we want.

Here's the relevant PG code that enforces the ownership check. It seems like there's no way to relax this in PG.
https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/backend/utils/init/miscinit.c#L331-L347

FYI, #1695, which is included in v1.37.0+, breaks the above workaround since the alpine container doesn't include things like useradd.

EDIT: more importantly, it's now using a read-only volume for /etc/passwd, so you can't add users