k3s-io/helm-controller

IPv6 cluster bootstrap failure

abasitt opened this issue · 7 comments

Can someone please explain how KUBERNETES_SERVICE_HOST value works. When I am trying to bootstrap IPv6 cluster, I am getting below error.

kubectl logs helm-install-cilium-jd8pw -n kube-system
if [[ ${KUBERNETES_SERVICE_HOST} =~ .:. ]]; then
echo "KUBERNETES_SERVICE_HOST is using IPv6"
CHART="${CHART//%{KUBERNETES_API}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
CHART="${CHART//%{KUBERNETES_API}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi

set +v -x

  • [[ true != \t\r\u\e ]]
  • [[ '' == \1 ]]
  • [[ '' == \v\2 ]]
  • shopt -s nullglob
  • [[ -f /config/ca-file.pem ]]
  • [[ -f /tmp/ca-file.pem ]]
  • [[ -n '' ]]
  • helm_content_decode
  • set -e
  • ENC_CHART_PATH=/chart/cilium.tgz.base64
  • CHART_PATH=/tmp/cilium.tgz
  • [[ ! -f /chart/cilium.tgz.base64 ]]
  • return
  • [[ install != \d\e\l\e\t\e ]]
  • helm_repo_init
  • grep -q -e 'https?://'
  • [[ helm_v3 == \h\e\l\m_\v\3 ]]
  • [[ cilium/cilium == stable/* ]]
  • [[ -n https://helm.cilium.io/ ]]
  • [[ -f /auth/username ]]
  • helm_v3 repo add cilium https://helm.cilium.io/
    "cilium" has been added to your repositories
  • helm_v3 repo update
    Hang tight while we grab the latest from your chart repositories...
    ...Successfully got an update from the "cilium" chart repository
    Update Complete. ⎈Happy Helming!⎈
  • helm_update install --namespace kube-system --version 1.14.5
  • [[ helm_v3 == \h\e\l\m_\v\3 ]]
    ++ helm_v3 ls --all -f '^cilium$' --namespace kube-system --output json
    ++ jq -r '"(.[0].app_version),(.[0].status)"'
    ++ tr '[:upper:]' '[:lower:]'
    Error: Kubernetes cluster unreachable: Get "https://127.0.0.1:6444/version": dial tcp 127.0.0.1:6444: connect: connection refused
  • LINE=
  • IFS=,
  • read -r INSTALLED_VERSION STATUS _
  • VALUES=
  • for VALUES_FILE in /config/*.yaml
  • VALUES=' --values /config/values-01_HelmChart.yaml'
  • [[ install = \d\e\l\e\t\e ]]
  • [[ '' =~ ^(|null)$ ]]
  • [[ '' =~ ^(|null)$ ]]
  • echo 'Installing helm_v3 chart'
  • helm_v3 install --namespace kube-system --version 1.14.5 cilium cilium/cilium --values /config/values-01_HelmChart.yaml
    Error: INSTALLATION FAILED: Kubernetes cluster unreachable: Get "https://127.0.0.1:6444/version": dial tcp 127.0.0.1:6444: connect: connection refused

I believe it should be trying to listen on ::1.

ss -nlpt | grep 6444
LISTEN 0 4096 [::1]:6444 [::]:* users:(("k3s-server",pid=4874,fd=16))

Seems like KUBERNETES_SERVICE_HOST is hard coded here? Is there a way I can tell helm controller to use ::1 instead of 127.0.0.1 ?

BTW it works when I am trying Dualstack cluster with a combination of {ipv4},{ipv6} but not the other way around.

KUBERNETES_SERVICE_HOST is set by Kubernetes itself, to the address of the kubernetes service: kubectl get service -n default kubernetes -o wide

See the upstream docs for more information service discovery environment variables:
https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables

Are you trying to use the helm controller to deploy cilium on k3s? Can you provide the actual content of the HelmChart resource that you are deploying?

Thank you @brandond.

kubectl get svc -owide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
kubernetes ClusterIP 2001:470:ee86:4:10:43:0:1 443/TCP 10h

I don't see helmcontroller picking the correct address then. KUBERNETES_SERVICE_HOST is somehow auto filled and it's always 127.0.0.1 and that's why I was wondering if this value is hard-coded.

Yes I am trying helm controller to deploy cilium. The charts is here.

The oyaml of the pod that is stuck in the CrashLoopBackOff.

kubectl get pod helm-install-cilium-jd8pw -n kube-system -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    helmcharts.helm.cattle.io/configHash: SHA256=58B430BF3EAC3F550E3E698901F5FE90E5813E65687253155DC57F03E6909F55
  creationTimestamp: "2024-01-19T13:38:57Z"
  finalizers:
  - batch.kubernetes.io/job-tracking
  generateName: helm-install-cilium-
  labels:
    batch.kubernetes.io/controller-uid: 9bb57e45-b3c6-4c21-b5dd-29b61b3f0118
    batch.kubernetes.io/job-name: helm-install-cilium
    controller-uid: 9bb57e45-b3c6-4c21-b5dd-29b61b3f0118
    helmcharts.helm.cattle.io/chart: cilium
    job-name: helm-install-cilium
  name: helm-install-cilium-jd8pw
  namespace: kube-system
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: helm-install-cilium
    uid: 9bb57e45-b3c6-4c21-b5dd-29b61b3f0118
  resourceVersion: "17775"
  uid: cd31640c-7bca-48b6-8e76-cac1e4a2a2f2
spec:
  containers:
  - args:
    - install
    - --namespace
    - kube-system
    - --version
    - 1.14.5
    env:
    - name: NAME
      value: cilium
    - name: VERSION
      value: 1.14.5
    - name: REPO
      value: https://helm.cilium.io/
    - name: HELM_DRIVER
      value: secret
    - name: CHART_NAMESPACE
      value: kube-system
    - name: CHART
      value: cilium/cilium
    - name: HELM_VERSION
    - name: TARGET_NAMESPACE
      value: kube-system
    - name: AUTH_PASS_CREDENTIALS
      value: "false"
    - name: KUBERNETES_SERVICE_HOST
      value: 127.0.0.1
    - name: KUBERNETES_SERVICE_PORT
      value: "6444"
    - name: BOOTSTRAP
      value: "true"
    - name: NO_PROXY
      value: .svc,.cluster.local,2001:470:ee86:1000::/56,10.42.0.0/16,2001:470:ee86:4:10:43::/112,10.43.0.0/16
    - name: FAILURE_POLICY
      value: reinstall
    image: rancher/klipper-helm:v0.8.2-build20230815
    imagePullPolicy: IfNotPresent
    name: helm
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /config
      name: values
    - mountPath: /chart
      name: content
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-6mhth
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  nodeName: k3s-m1
  nodeSelector:
    kubernetes.io/os: linux
    node-role.kubernetes.io/control-plane: "true"
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: helm-cilium
  serviceAccountName: helm-cilium
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node.kubernetes.io/not-ready
  - effect: NoSchedule
    key: node.cloudprovider.kubernetes.io/uninitialized
    operator: Equal
    value: "true"
  - key: CriticalAddonsOnly
    operator: Exists
  - effect: NoExecute
    key: node-role.kubernetes.io/etcd
    operator: Exists
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 20
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 20
  volumes:
  - name: values
    secret:
      defaultMode: 420
      secretName: chart-values-cilium
  - configMap:
      defaultMode: 420
      name: chart-content-cilium
    name: content
  - name: kube-api-access-6mhth
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-01-19T13:38:57Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-01-20T00:16:58Z"
    message: 'containers with unready status: [helm]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-01-20T00:16:58Z"
    message: 'containers with unready status: [helm]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-01-19T13:38:57Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://73c16c515f12a8ac89af117e4766478576bca73e4540216b0956820273c6b6e6
    image: docker.io/rancher/klipper-helm:v0.8.2-build20230815
    imageID: docker.io/rancher/klipper-helm@sha256:b0b0c4f73f2391697edb52adffe4fc490de1c8590606024515bb906b2813554a
    lastState:
      terminated:
        containerID: containerd://73c16c515f12a8ac89af117e4766478576bca73e4540216b0956820273c6b6e6
        exitCode: 1
        finishedAt: "2024-01-20T00:16:58Z"
        message: |
          Installing helm_v3 chart
        reason: Error
        startedAt: "2024-01-20T00:16:56Z"
    name: helm
    ready: false
    restartCount: 129
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=helm pod=helm-install-cilium-jd8pw_kube-system(cd31640c-7bca-48b6-8e76-cac1e4a2a2f2)
        reason: CrashLoopBackOff
  hostIP: 2001:470:ee86:30:192:168:30:21
  phase: Running
  podIP: 2001:470:ee86:30:192:168:30:21
  podIPs:
  - ip: 2001:470:ee86:30:192:168:30:21
  qosClass: BestEffort
  startTime: "2024-01-19T13:38:57Z"

For bootstrap charts,KUBERNETES_SERVICE_HOST is filled in by the helm controller, and always points at the IPv4 loopback address:

Name: "KUBERNETES_SERVICE_HOST",
Value: "127.0.0.1"},

I know that this works on ipv6-only nodes, since the apiserver binds to a dual-stack wildcard address that is accessible via the IPv4 loopback even even when the node does not have a valid IPv4 interface address.

Does your node for some reason not have an IPv4 loopback configured, or have you done something else to modify the apiserver configuration? Are you using the helm controller as part of k3s, rke2, or on some other cluster type? The helm controller is almost exclusively used with k3s and rke2, if you are running it standalone on some other cluster, the assumptions that it makes about apiserver availability when installing bootstrap charts may not be valid.

The oyaml of the pod that is stuck in the CrashLoopBackOff.

No, I was asking for the yaml of the HelmChart resource that you are using to deploy cilium.
I do see this though, which indicates that you are trying to install this as a bootstrap chart, which requires execution on a control-plane node with the apiserver available on port 6443:

    - name: BOOTSTRAP
      value: "true"

I do have both IPv4 and IPv6 address on the node. As mentioned earlier, Dualstack works with a combination of {ipv4},{ipv6}.
I didn't try with IPv6 single stack but its interesting to hear it works.
I tried few things and added bind-address, and advertised address it solved half the problem. The difference I see is in the listener now both on ipv4 and ipv6.
ss -nltp | grep 6444 LISTEN 0 4096 *:6444 *:* users:(("k3s-server",pid=6615,fd=19))

And my master node status.
kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k3s-m1 Ready control-plane,master 24m v1.28.5+k3s1 2001:470:ee86:30:192:168:30:21 <none> Ubuntu 22.04.3 LTS 5.15.0-91-generic containerd://1.7.11-k3s2

Another half problem I am facing with agent and i don't think so it is related to this issue but would love to hear some suggestions.

systemctl status k3s ● k3s.service - Lightweight Kubernetes Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled) Active: active (running) since Sat 2024-01-20 19:09:34 +08; 4s ago Docs: https://k3s.io Process: 3947 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, sta> Process: 3949 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS) Process: 3950 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS) Main PID: 3951 (k3s-agent) Tasks: 9 Memory: 21.5M CPU: 113ms CGroup: /system.slice/k3s.service └─3951 "/usr/local/bin/k3s agent" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "> Jan 20 19:09:34 k3s-w1 systemd[1]: Starting Lightweight Kubernetes... Jan 20 19:09:34 k3s-w1 sh[3947]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service Jan 20 19:09:34 k3s-w1 sh[3948]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory Jan 20 19:09:34 k3s-w1 systemd[1]: Started Lightweight Kubernetes. Jan 20 19:09:34 k3s-w1 k3s[3951]: time="2024-01-20T19:09:34+08:00" level=info msg="Starting k3s agent v1.28.5+k3s1 (5b2d127>Jan 20 19:09:34 k3s-w1 k3s[3951]: time="2024-01-20T19:09:34+08:00" level=info msg="Adding server to load balancer k3s-agent>Jan 20 19:09:34 k3s-w1 k3s[3951]: time="2024-01-20T19:09:34+08:00" level=info msg="Running load balancer k3s-agent-load-bal>Jan 20 19:09:34 k3s-w1 k3s[3951]: time="2024-01-20T19:09:34+08:00" level=error msg="failed to get CA certs: Get \"https://1>Jan 20 19:09:36 k3s-w1 k3s[3951]: time="2024-01-20T19:09:36+08:00" level=error msg="failed to get CA certs: Get \"https://1>Jan 20 19:09:38 k3s-w1 k3s[3951]: time="2024-01-20T19:09:38+08:00" level=error msg="failed to get CA certs: Get \"https://1>

:00" level=info msg="Starting k3s agent v1.28.5+k3s1 (5b2d1271)" :00" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 2001:470:ee86:30:192:168:30:21:6443" :00" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [2001:470:ee86:30:192:168:30:21:6443] >:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": EOF" :00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:59438->127.0.0.1:6>:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": EOF"

the listener on agent is ipv4 only instead of ipv4/ipv6 :( i believe it should be same as server.
ss -nltp | grep 6444 LISTEN 0 4096 127.0.0.1:6444 0.0.0.0:* users:(("k3s-agent",pid=3951,fd=7))

My agent config.yaml is below.

cat /etc/rancher/k3s/config.yaml

kubelet-arg:
- image-gc-high-threshold=55
- image-gc-low-threshold=50
- 'node-ip=::'
node-ip: 2001:470:ee86:30:192:168:30:22,192.168.30.22
pause-image: registry.k8s.io/pause:3.9

@brandond thank you so much for all the inputs. I think I figured out what the issue was with the agent, it was missing [] in the server url. After adding that, it's working.

kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k3s-m1 Ready control-plane,master 4m3s v1.28.5+k3s1 2001:470:ee86:30:192:168:30:21 <none> Ubuntu 22.04.3 LTS 5.15.0-91-generic containerd://1.7.11-k3s2 k3s-w1 Ready <none> 3m32s v1.28.5+k3s1 2001:470:ee86:30:192:168:30:22 <none> Ubuntu 22.04.3 LTS 5.15.0-91-generic containerd://1.7.11-k3s2