IPv6 cluster bootstrap failure
abasitt opened this issue · 7 comments
Can someone please explain how KUBERNETES_SERVICE_HOST value works. When I am trying to bootstrap IPv6 cluster, I am getting below error.
kubectl logs helm-install-cilium-jd8pw -n kube-system
if [[ ${KUBERNETES_SERVICE_HOST} =~ .:. ]]; then
echo "KUBERNETES_SERVICE_HOST is using IPv6"
CHART="${CHART//%{KUBERNETES_API}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
CHART="${CHART//%{KUBERNETES_API}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fiset +v -x
- [[ true != \t\r\u\e ]]
- [[ '' == \1 ]]
- [[ '' == \v\2 ]]
- shopt -s nullglob
- [[ -f /config/ca-file.pem ]]
- [[ -f /tmp/ca-file.pem ]]
- [[ -n '' ]]
- helm_content_decode
- set -e
- ENC_CHART_PATH=/chart/cilium.tgz.base64
- CHART_PATH=/tmp/cilium.tgz
- [[ ! -f /chart/cilium.tgz.base64 ]]
- return
- [[ install != \d\e\l\e\t\e ]]
- helm_repo_init
- grep -q -e 'https?://'
- [[ helm_v3 == \h\e\l\m_\v\3 ]]
- [[ cilium/cilium == stable/* ]]
- [[ -n https://helm.cilium.io/ ]]
- [[ -f /auth/username ]]
- helm_v3 repo add cilium https://helm.cilium.io/
"cilium" has been added to your repositories- helm_v3 repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "cilium" chart repository
Update Complete. ⎈Happy Helming!⎈- helm_update install --namespace kube-system --version 1.14.5
- [[ helm_v3 == \h\e\l\m_\v\3 ]]
++ helm_v3 ls --all -f '^cilium$' --namespace kube-system --output json
++ jq -r '"(.[0].app_version),(.[0].status)"'
++ tr '[:upper:]' '[:lower:]'
Error: Kubernetes cluster unreachable: Get "https://127.0.0.1:6444/version": dial tcp 127.0.0.1:6444: connect: connection refused- LINE=
- IFS=,
- read -r INSTALLED_VERSION STATUS _
- VALUES=
- for VALUES_FILE in /config/*.yaml
- VALUES=' --values /config/values-01_HelmChart.yaml'
- [[ install = \d\e\l\e\t\e ]]
- [[ '' =~ ^(|null)$ ]]
- [[ '' =~ ^(|null)$ ]]
- echo 'Installing helm_v3 chart'
- helm_v3 install --namespace kube-system --version 1.14.5 cilium cilium/cilium --values /config/values-01_HelmChart.yaml
Error: INSTALLATION FAILED: Kubernetes cluster unreachable: Get "https://127.0.0.1:6444/version": dial tcp 127.0.0.1:6444: connect: connection refused
I believe it should be trying to listen on ::1.
ss -nlpt | grep 6444
LISTEN 0 4096 [::1]:6444 [::]:* users:(("k3s-server",pid=4874,fd=16))
Seems like KUBERNETES_SERVICE_HOST is hard coded here? Is there a way I can tell helm controller to use ::1 instead of 127.0.0.1 ?
BTW it works when I am trying Dualstack cluster with a combination of {ipv4},{ipv6} but not the other way around.
KUBERNETES_SERVICE_HOST is set by Kubernetes itself, to the address of the kubernetes service: kubectl get service -n default kubernetes -o wide
See the upstream docs for more information service discovery environment variables:
https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables
Are you trying to use the helm controller to deploy cilium on k3s? Can you provide the actual content of the HelmChart resource that you are deploying?
Thank you @brandond.
kubectl get svc -owide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
kubernetes ClusterIP 2001:470:ee86:4:10:43:0:1 443/TCP 10h
I don't see helmcontroller picking the correct address then. KUBERNETES_SERVICE_HOST is somehow auto filled and it's always 127.0.0.1 and that's why I was wondering if this value is hard-coded.
Yes I am trying helm controller to deploy cilium. The charts is here.
The oyaml of the pod that is stuck in the CrashLoopBackOff.
kubectl get pod helm-install-cilium-jd8pw -n kube-system -oyaml
apiVersion: v1
kind: Pod
metadata:
annotations:
helmcharts.helm.cattle.io/configHash: SHA256=58B430BF3EAC3F550E3E698901F5FE90E5813E65687253155DC57F03E6909F55
creationTimestamp: "2024-01-19T13:38:57Z"
finalizers:
- batch.kubernetes.io/job-tracking
generateName: helm-install-cilium-
labels:
batch.kubernetes.io/controller-uid: 9bb57e45-b3c6-4c21-b5dd-29b61b3f0118
batch.kubernetes.io/job-name: helm-install-cilium
controller-uid: 9bb57e45-b3c6-4c21-b5dd-29b61b3f0118
helmcharts.helm.cattle.io/chart: cilium
job-name: helm-install-cilium
name: helm-install-cilium-jd8pw
namespace: kube-system
ownerReferences:
- apiVersion: batch/v1
blockOwnerDeletion: true
controller: true
kind: Job
name: helm-install-cilium
uid: 9bb57e45-b3c6-4c21-b5dd-29b61b3f0118
resourceVersion: "17775"
uid: cd31640c-7bca-48b6-8e76-cac1e4a2a2f2
spec:
containers:
- args:
- install
- --namespace
- kube-system
- --version
- 1.14.5
env:
- name: NAME
value: cilium
- name: VERSION
value: 1.14.5
- name: REPO
value: https://helm.cilium.io/
- name: HELM_DRIVER
value: secret
- name: CHART_NAMESPACE
value: kube-system
- name: CHART
value: cilium/cilium
- name: HELM_VERSION
- name: TARGET_NAMESPACE
value: kube-system
- name: AUTH_PASS_CREDENTIALS
value: "false"
- name: KUBERNETES_SERVICE_HOST
value: 127.0.0.1
- name: KUBERNETES_SERVICE_PORT
value: "6444"
- name: BOOTSTRAP
value: "true"
- name: NO_PROXY
value: .svc,.cluster.local,2001:470:ee86:1000::/56,10.42.0.0/16,2001:470:ee86:4:10:43::/112,10.43.0.0/16
- name: FAILURE_POLICY
value: reinstall
image: rancher/klipper-helm:v0.8.2-build20230815
imagePullPolicy: IfNotPresent
name: helm
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /config
name: values
- mountPath: /chart
name: content
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-6mhth
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostNetwork: true
nodeName: k3s-m1
nodeSelector:
kubernetes.io/os: linux
node-role.kubernetes.io/control-plane: "true"
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: OnFailure
schedulerName: default-scheduler
securityContext: {}
serviceAccount: helm-cilium
serviceAccountName: helm-cilium
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node.kubernetes.io/not-ready
- effect: NoSchedule
key: node.cloudprovider.kubernetes.io/uninitialized
operator: Equal
value: "true"
- key: CriticalAddonsOnly
operator: Exists
- effect: NoExecute
key: node-role.kubernetes.io/etcd
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 20
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 20
volumes:
- name: values
secret:
defaultMode: 420
secretName: chart-values-cilium
- configMap:
defaultMode: 420
name: chart-content-cilium
name: content
- name: kube-api-access-6mhth
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-01-19T13:38:57Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2024-01-20T00:16:58Z"
message: 'containers with unready status: [helm]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2024-01-20T00:16:58Z"
message: 'containers with unready status: [helm]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2024-01-19T13:38:57Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://73c16c515f12a8ac89af117e4766478576bca73e4540216b0956820273c6b6e6
image: docker.io/rancher/klipper-helm:v0.8.2-build20230815
imageID: docker.io/rancher/klipper-helm@sha256:b0b0c4f73f2391697edb52adffe4fc490de1c8590606024515bb906b2813554a
lastState:
terminated:
containerID: containerd://73c16c515f12a8ac89af117e4766478576bca73e4540216b0956820273c6b6e6
exitCode: 1
finishedAt: "2024-01-20T00:16:58Z"
message: |
Installing helm_v3 chart
reason: Error
startedAt: "2024-01-20T00:16:56Z"
name: helm
ready: false
restartCount: 129
started: false
state:
waiting:
message: back-off 5m0s restarting failed container=helm pod=helm-install-cilium-jd8pw_kube-system(cd31640c-7bca-48b6-8e76-cac1e4a2a2f2)
reason: CrashLoopBackOff
hostIP: 2001:470:ee86:30:192:168:30:21
phase: Running
podIP: 2001:470:ee86:30:192:168:30:21
podIPs:
- ip: 2001:470:ee86:30:192:168:30:21
qosClass: BestEffort
startTime: "2024-01-19T13:38:57Z"
For bootstrap charts,KUBERNETES_SERVICE_HOST
is filled in by the helm controller, and always points at the IPv4 loopback address:
helm-controller/pkg/controllers/chart/chart.go
Lines 556 to 557 in f9103f6
I know that this works on ipv6-only nodes, since the apiserver binds to a dual-stack wildcard address that is accessible via the IPv4 loopback even even when the node does not have a valid IPv4 interface address.
Does your node for some reason not have an IPv4 loopback configured, or have you done something else to modify the apiserver configuration? Are you using the helm controller as part of k3s, rke2, or on some other cluster type? The helm controller is almost exclusively used with k3s and rke2, if you are running it standalone on some other cluster, the assumptions that it makes about apiserver availability when installing bootstrap charts may not be valid.
The oyaml of the pod that is stuck in the CrashLoopBackOff.
No, I was asking for the yaml of the HelmChart resource that you are using to deploy cilium.
I do see this though, which indicates that you are trying to install this as a bootstrap chart, which requires execution on a control-plane node with the apiserver available on port 6443:
- name: BOOTSTRAP
value: "true"
I do have both IPv4 and IPv6 address on the node. As mentioned earlier, Dualstack works with a combination of {ipv4},{ipv6}.
I didn't try with IPv6 single stack but its interesting to hear it works.
I tried few things and added bind-address, and advertised address it solved half the problem. The difference I see is in the listener now both on ipv4 and ipv6.
ss -nltp | grep 6444 LISTEN 0 4096 *:6444 *:* users:(("k3s-server",pid=6615,fd=19))
And my master node status.
kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k3s-m1 Ready control-plane,master 24m v1.28.5+k3s1 2001:470:ee86:30:192:168:30:21 <none> Ubuntu 22.04.3 LTS 5.15.0-91-generic containerd://1.7.11-k3s2
Another half problem I am facing with agent and i don't think so it is related to this issue but would love to hear some suggestions.
systemctl status k3s ● k3s.service - Lightweight Kubernetes Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled) Active: active (running) since Sat 2024-01-20 19:09:34 +08; 4s ago Docs: https://k3s.io Process: 3947 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, sta> Process: 3949 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS) Process: 3950 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS) Main PID: 3951 (k3s-agent) Tasks: 9 Memory: 21.5M CPU: 113ms CGroup: /system.slice/k3s.service └─3951 "/usr/local/bin/k3s agent" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "> Jan 20 19:09:34 k3s-w1 systemd[1]: Starting Lightweight Kubernetes... Jan 20 19:09:34 k3s-w1 sh[3947]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service Jan 20 19:09:34 k3s-w1 sh[3948]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory Jan 20 19:09:34 k3s-w1 systemd[1]: Started Lightweight Kubernetes. Jan 20 19:09:34 k3s-w1 k3s[3951]: time="2024-01-20T19:09:34+08:00" level=info msg="Starting k3s agent v1.28.5+k3s1 (5b2d127>Jan 20 19:09:34 k3s-w1 k3s[3951]: time="2024-01-20T19:09:34+08:00" level=info msg="Adding server to load balancer k3s-agent>Jan 20 19:09:34 k3s-w1 k3s[3951]: time="2024-01-20T19:09:34+08:00" level=info msg="Running load balancer k3s-agent-load-bal>Jan 20 19:09:34 k3s-w1 k3s[3951]: time="2024-01-20T19:09:34+08:00" level=error msg="failed to get CA certs: Get \"https://1>Jan 20 19:09:36 k3s-w1 k3s[3951]: time="2024-01-20T19:09:36+08:00" level=error msg="failed to get CA certs: Get \"https://1>Jan 20 19:09:38 k3s-w1 k3s[3951]: time="2024-01-20T19:09:38+08:00" level=error msg="failed to get CA certs: Get \"https://1>
:00" level=info msg="Starting k3s agent v1.28.5+k3s1 (5b2d1271)" :00" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 2001:470:ee86:30:192:168:30:21:6443" :00" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [2001:470:ee86:30:192:168:30:21:6443] >:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": EOF" :00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:59438->127.0.0.1:6>:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": EOF"
the listener on agent is ipv4 only instead of ipv4/ipv6 :( i believe it should be same as server.
ss -nltp | grep 6444 LISTEN 0 4096 127.0.0.1:6444 0.0.0.0:* users:(("k3s-agent",pid=3951,fd=7))
My agent config.yaml is below.
cat /etc/rancher/k3s/config.yaml
kubelet-arg:
- image-gc-high-threshold=55
- image-gc-low-threshold=50
- 'node-ip=::'
node-ip: 2001:470:ee86:30:192:168:30:22,192.168.30.22
pause-image: registry.k8s.io/pause:3.9
@brandond thank you so much for all the inputs. I think I figured out what the issue was with the agent, it was missing [] in the server url. After adding that, it's working.
kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k3s-m1 Ready control-plane,master 4m3s v1.28.5+k3s1 2001:470:ee86:30:192:168:30:21 <none> Ubuntu 22.04.3 LTS 5.15.0-91-generic containerd://1.7.11-k3s2 k3s-w1 Ready <none> 3m32s v1.28.5+k3s1 2001:470:ee86:30:192:168:30:22 <none> Ubuntu 22.04.3 LTS 5.15.0-91-generic containerd://1.7.11-k3s2