cloudfoundry/guardian

gdn fail with runc error in ubuntu 2204 lts

Closed this issue · 10 comments

Description

When running Concourse binary (using gdn for containization) in google VM with ubuntu-2204-lts family as OS image, we see errors as below

Aug 25 21:56:12 smoke-splendid-earwig concourse[4460]: {"timestamp":"2022-08-25T21:56:12.809930620Z","level":"error","source":"guardian","message":"guardian.create.containerizer-create.runtime-create-failed","data":{"error":"runc run: exit status 1: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting \"cgroup\" to rootfs at \"/sys/fs/cgroup\" caused: invalid argument","handle":"a17876d5-647e-492d-6ae2-311b1a56d718","session":"40.3"}}

For comparison, when running Concourse by docker compose locally we don't see the error. The OS image is the same as the VM in GCP

root@c29ddbf435bd:/src# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

but is kernel is 5.10.47-linuxkit.

Also, when running Concourse with containerd runtime that directly using runc v1.1.4 we dont see error in both local docker or gcp VM.

Maybe it is related to the older runc that is currently used in guardian where it might not work well with specific newer kernel in ubuntu Jammy jellyfish?

  • Guardian release version: 1.22
  • Linux kernel version: 5.15.0-1016-gcp
  • Concourse version: latest dev
  • Go version: 1.19

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

This issue is being worked on under the Garden-runc-release/#233 issue

dtimm commented

It looks like this is the same issue that other contain runtimes have had with Jammy: containers/podman#12559 .

Jammy uses cgroupv2 in the kernel, and it delegates cgroup authority to sub-processes (like the container runtime) as cgroupv2. runc supports cgroupv2 as of v1.0.0 release, but gdn is also directly altering cgroups using the old v1 schema:

return filepath.Join(cgroupsMountpoint, "cpu", cpuCgroupSubPath["cpu"], gardenCgroup), nil

This will require some substantial changes in how cgroups are managed in guardian in order to support new distributions that have switched to cgroupv2.

Some updates:

Concourse with latest gdn can run successfully on an image with cgroups v1 enabled based on gcloud image family ubuntu-2204-lts .

Hi @xtremerui ,
Is this issue still outstanding for you or did the newer image resolve it for you?

@MarcPaquette the image with cgroups v1 enabled works for us. We still hoping gdn works for an image with cgroups v2 available only.

@xtremerui Our team is starting to scope the work to use cgroups v2 only. We'll keep you updated as that work starts to get done.

@dsabeti this is great news! Thank you and the team.

Looking into this, we'd need to get a new stemcell built to allow the usage of cgroup v2. Currently the bosh stemcell builder is forcing us to use v1: https://github.com/cloudfoundry/bosh-linux-stemcell-builder/blob/57cd1eb14ddebd9666f15e83ecfa18f31350d45f/stemcell_builder/stages/image_install_grub/apply.sh#L89

I'm working on discussing this with Product Management.

I'm going to close out this issue, as it's a known issue and we have future plans to resolve it. We're waiting on the Stemcell builds that enable this feature by default.