Race between kubeadm and containerd service in Flatcar

Question

Race between kubeadm and containerd service in Flatcar

kopiczko opened this issue 2 years ago · 18 comments

What steps did you take and what happened:
When booting a cluster with Flatcar Linux (occurred in CAPO) there can be a race between kubeadm.service and containerd.service:

ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

What did you expect to happen:
containerd running before kubead starts.

Anything else you would like to add:
Adding this to KubeadmConfigTemplate fixes it:

      format: ignition
      ignition:
        containerLinuxConfig:
          additionalConfig: |
            systemd:
              units:
              - name: kubeadm.service
                enabled: true
                dropins:
                - name: 10-flatcar.conf
                  contents: |
                    [Unit]
                    Requires=containerd.service
                    After=containerd.service

Environment:

Project (Image Builder for Cluster API, kube-deploy/imagebuilder, konfigadm):

Additional info for Image Builder for Cluster API related issues:

OS (e.g. from /etc/os-release, or cmd /c ver):
Packer Version:
Packer Provider:
Ansible Version:
Cluster-api version (if using):
Kubernetes version: (use kubectl version):

/kind bug

/cc @invidian @pothos @dongsupark @johananl

Answer 1 · 2022-07-22T16:55:45.000Z

This is fixed in the latest Flatcar release where containerd is enabled by default: https://www.flatcar.org/releases#release-3227.2.0

Edit: I thought it's about the enabling, but it's also about the After=, it seems

Answer 2 · 2022-07-22T16:59:05.000Z

I recall we've been discussing this with @jepio somewhere. Let me dig that up.

Answer 3 · 2022-07-22T17:15:00.000Z

I recall we've been discussing this with @jepio somewhere. Let me dig that up.

Hmm, I can't find it, but the point was, that the kubeadm.service is being provided by the CABPK and image-builder CAPI images defaults to use containerd. So I believe image-builder should make no assumptions about the bootstrap provider which will be used. And CABPK should make no assumptions that containerd will be used. The place to define the binding mentioned in the issue as a workaround should be done in the cluster templates, where image reference is binded with the selected bootstrap provider, which is done right now in the platform templates, so https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/templates/cluster-template-flatcar.yaml and in kubernetes-sigs/cluster-api-provider-azure#1729.

Answer 4 · 2022-10-04T17:20:18.000Z

This fix is no longer working.

Issue states:
unable to translate from 2.x to 3.x config has duplicate unit name kubeadm.service

ignition:
  containerLinuxConfig:
    additionalConfig: |
      storage:
        links:
        - path: /etc/systemd/system/kubeadm.service.wants/containerd.service
          target: /usr/lib/systemd/system/containerd.service

This worked

Answer 5 · 2023-01-02T18:45:42.000Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Answer 6 · 2023-02-01T19:32:51.000Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Answer 7 · 2023-03-01T14:25:57.000Z

/remove-lifecycle rotten

Answer 8 · 2023-05-30T14:52:26.000Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Answer 9 · 2023-05-30T21:57:14.000Z

/remove-lifecycle stale

Answer 10 · 2023-07-24T15:11:30.000Z

I'll take this. Will try to reproduce using current Flatcar and CAPI versions and check if a fix is needed.
/assign