`calico-kube-controllers` fails to reach `10.96.0.1:443` on startup

Question

`calico-kube-controllers` fails to reach `10.96.0.1:443` on startup

Bolodya1997 opened this issue 3 years ago · 3 comments

Bolodya1997 commented 3 years ago

Environment

Calico/VPP version: v0.16.0-calicov3.20.0.
Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T17:56:19Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

Deployment type: bare-metal on equinix.metal, host nodes OS is Ubuntu 20.04 LTS.
Network configuration:

control plane node
---
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether e4:43:4b:6d:60:40 brd ff:ff:ff:ff:ff:ff
4: eno3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether e4:43:4b:6d:60:40 brd ff:ff:ff:ff:ff:ff
5: eno4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether e4:43:4b:6d:60:43 brd ff:ff:ff:ff:ff:ff
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether e4:43:4b:6d:60:40 brd ff:ff:ff:ff:ff:ff
    inet 147.75.38.85/31 brd 255.255.255.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet 10.99.35.131/31 brd 255.255.255.255 scope global bond0:0
       valid_lft forever preferred_lft forever
    inet6 2604:1380:0:2c00::3/127 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::e643:4bff:fe6d:6040/64 scope link 
       valid_lft forever preferred_lft forever
7: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:ee:ab:db:8d brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
8: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN group default qlen 1000
    link/ether e4:43:4b:6d:60:41 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.1/30 brd 10.0.0.3 scope global eno2
       valid_lft forever preferred_lft forever
    inet6 fe80::e643:4bff:fe6d:6041/64 scope link 
       valid_lft forever preferred_lft forever

worker node
---
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether e4:43:4b:5f:6d:50 brd ff:ff:ff:ff:ff:ff
4: eno3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether e4:43:4b:5f:6d:50 brd ff:ff:ff:ff:ff:ff
5: eno4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether e4:43:4b:5f:6d:53 brd ff:ff:ff:ff:ff:ff
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether e4:43:4b:5f:6d:50 brd ff:ff:ff:ff:ff:ff
    inet 147.75.75.133/31 brd 255.255.255.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet 10.99.35.129/31 brd 255.255.255.255 scope global bond0:0
       valid_lft forever preferred_lft forever
    inet6 2604:1380:0:2c00::1/127 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::e643:4bff:fe5f:6d50/64 scope link 
       valid_lft forever preferred_lft forever
7: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:15:76:3c:b8 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
8: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN group default qlen 1000
    link/ether e4:43:4b:5f:6d:51 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.2/30 brd 10.0.0.3 scope global eno2
       valid_lft forever preferred_lft forever
    inet6 fe80::e643:4bff:fe5f:6d51/64 scope link 
       valid_lft forever preferred_lft forever

Control plane node eno2 and worker node eno2 are in the same untagged VLAN.

Issue description
After applying calico-vpp-nohuge.yaml with additionally configured

  vpp_dataplane_interface: eno2

calico-kube-controllers loops in a CrashLoopBackOff with the following error in logs:

$ kubectl -n kube-system logs calico-kube-controllers-58497c65d5-xm54w
2021-08-31 08:12:45.578 [INFO][1] main.go 94: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0831 08:12:45.580376       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2021-08-31 08:12:45.581 [INFO][1] main.go 115: Ensuring Calico datastore is initialized
2021-08-31 08:12:55.581 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-08-31 08:12:55.581 [FATAL][1] main.go 120: Failed to initialize Calico datastore error=Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded

To Reproduce
Steps to reproduce the behavior:

Setup 2 n2.xlarge.x86 servers with Ubuntu 20.04 LTS on https://metal.equinix.com/.
- or just probably setup 2 bare-metal servers with Ubuntu 20.04 LTS.
Configure both of them to have:
1. Public IPv4 address on one interface (here it is bond0 with 147.75.38.85/31 for control plane node, with 147.75.75.133/31 for worker node).
2. Local IPv4 addresses in the same subnet on another interface (here it is eno2 with 10.0.0.1/30 for control plane node, with 10.0.0.2/30 for worker node).
  - on equinix metal this can be done with single VLAN assigned to the corresponding interfaces.
Configure docker on both nodes:

#!/bin/bash

mkdir -p /etc/docker

echo \
'{
    "exec-opts": ["native.cgroupdriver=systemd"]
}' >/etc/docker/daemon.json

Install environment on both nodes:

#!/bin/sh

KUBERNETES_VERSION=1.21.1-00

curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
cat <<EOF >/etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF

apt-get update
apt-get install -y docker.io
apt-get install -qy kubelet="${KUBERNETES_VERSION}" kubectl="${KUBERNETES_VERSION}" kubeadm="${KUBERNETES_VERSION}"

systemctl daemon-reload
systemctl restart kubelet

swapoff --all

Start kubernetes cluster on control plane node:

#!/bin/sh

set -e

K8S_DIR=$(dirname "$0")

KUBERNETES_INIT_VERSION=1.21.1
kubeadm init \
        --kubernetes-version "${KUBERNETES_INIT_VERSION}" \
        --pod-network-cidr=192.168.0.0/16 \
        --skip-token-print \
        --apiserver-advertise-address=147.75.38.85 # use here your control plane node public IP address

mkdir -p "$HOME"/.kube
sudo cp -f /etc/kubernetes/admin.conf "$HOME"/.kube/config
sudo chown "$(id -u):$(id -g)" "$HOME"/.kube/config

kubectl taint nodes --all node-role.kubernetes.io/master-

kubeadm token create --print-join-command > "${K8S_DIR}/join-cluster.sh"

Copy join-cluster.sh script from control plane node to worker node and run it.
Setup control plane node to use 10.0.0.1 as node IP:

#!/bin/sh

sed -Ei 's/(.*)"/\1 --node-ip=10\.0\.0\.1"/g' /var/lib/kubelet/kubeadm-flags.env
systemctl restart kubelet

Setup worker node to use 10.0.0.2 as node IP:

#!/bin/sh

sed -Ei 's/(.*)"/\1 --node-ip=10\.0\.0\.2"/g' /var/lib/kubelet/kubeadm-flags.env
systemctl restart kubelet

Copy ~/.kube/config from the control plane node to your own host.
Edit calico-vpp-nohuge.yaml with:

  vpp_dataplane_interface: eno2 # use here interface name used with local IPv4 address on nodes

Run kubectl apply -f calico-vpp-nohuge.yaml from your own host.

Expected behavior
All calico pods should start running (probably with few restarts).

Additional context

Same setup with weave as CNI works correctly.
Control plane node is able to reach 10.96.0.1:443:

$ nc -vw 2 10.96.0.1 443
Connection to 10.96.0.1 443 port [tcp/https] succeeded!

Worker node is able to reach 10.96.0.1:443:

$ nc -vw 2 10.96.0.1 443
Connection to 10.96.0.1 443 port [tcp/https] succeeded!

None of pods with hostNetwork: false from kube-system is able to reach 10.96.0.1:443:

$ kubectl -n kube-system exec alpine -- nc -vw 2 10.96.0.1 443
nc: 10.96.0.1 (10.96.0.1:443): Operation timed out
command terminated with exit code 1

Kubectl get:

$ kubectl get nodes -o wide
NAME       STATUS   ROLES                  AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
master-1   Ready    control-plane,master   177m   v1.21.1   10.0.0.1      <none>        Ubuntu 20.04.3 LTS   5.4.0-81-generic   docker://20.10.7
worker     Ready    <none>                 176m   v1.21.1   10.0.0.2      <none>        Ubuntu 20.04.3 LTS   5.4.0-81-generic   docker://20.10.7

$ kubectl -n calico-vpp-dataplane get pods -o wide
NAME                    READY   STATUS    RESTARTS   AGE    IP         NODE       NOMINATED NODE   READINESS GATES
calico-vpp-node-bt5md   2/2     Running   0          174m   10.0.0.2   worker     <none>           <none>
calico-vpp-node-ssbr4   2/2     Running   0          174m   10.0.0.1   master-1   <none>           <none>

$ kubectl -n kube-system get pods -o wide
NAME                                       READY   STATUS             RESTARTS   AGE    IP               NODE       NOMINATED NODE   READINESS GATES
alpine                                     1/1     Running            0          12m    192.168.171.68   worker     <none>           <none>
calico-kube-controllers-58497c65d5-xm54w   0/1     CrashLoopBackOff   39         3h2m   192.168.171.67   worker     <none>           <none>
calico-node-4hws2                          1/1     Running            0          3h2m   10.0.0.2         worker     <none>           <none>
calico-node-xrv7h                          1/1     Running            0          3h2m   10.0.0.1         master-1   <none>           <none>
coredns-558bd4d5db-dj9qx                   0/1     Running            0          3h6m   192.168.171.65   worker     <none>           <none>
coredns-558bd4d5db-xsqdf                   0/1     Running            0          3h6m   192.168.171.66   worker     <none>           <none>
etcd-master-1                              1/1     Running            0          3h6m   10.0.0.1         master-1   <none>           <none>
kube-apiserver-master-1                    1/1     Running            0          3h6m   10.0.0.1         master-1   <none>           <none>
kube-controller-manager-master-1           1/1     Running            0          3h6m   10.0.0.1         master-1   <none>           <none>
kube-proxy-rpdrr                           1/1     Running            0          3h6m   10.0.0.2         worker     <none>           <none>
kube-proxy-s74x9                           1/1     Running            0          3h6m   10.0.0.1         master-1   <none>           <none>
kube-scheduler-master-1                    1/1     Running            0          3h6m   10.0.0.1         master-1   <none>           <none>

$ kubectl get svc --all-namespaces
NAMESPACE     NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
default       kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP                  3h10m
kube-system   kube-dns     ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   3h10m

Both coredns pods fail to get ready with the following logs:

[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.8.0
linux/amd64, go1.15.3, 054c9ae
[ERROR] plugin/errors: 2 6061867283059196170.1749449744353646990. HINFO: read udp 192.168.171.65:60256->147.75.207.207:53: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
[ERROR] plugin/errors: 2 6061867283059196170.1749449744353646990. HINFO: read udp 192.168.171.65:49716->147.75.207.208:53: i/o timeout
[ERROR] plugin/errors: 2 6061867283059196170.1749449744353646990. HINFO: read udp 192.168.171.65:59029->147.75.207.208:53: i/o timeout
[ERROR] plugin/errors: 2 6061867283059196170.1749449744353646990. HINFO: read udp 192.168.171.65:54880->147.75.207.208:53: i/o timeout
[ERROR] plugin/errors: 2 6061867283059196170.1749449744353646990. HINFO: read udp 192.168.171.65:39632->147.75.207.208:53: i/o timeout
[ERROR] plugin/errors: 2 6061867283059196170.1749449744353646990. HINFO: read udp 192.168.171.65:38548->147.75.207.208:53: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
[ERROR] plugin/errors: 2 6061867283059196170.1749449744353646990. HINFO: read udp 192.168.171.65:60599->147.75.207.207:53: i/o timeout
[ERROR] plugin/errors: 2 6061867283059196170.1749449744353646990. HINFO: read udp 192.168.171.65:45834->147.75.207.208:53: i/o timeout
I0831 05:59:55.406322       1 trace.go:205] Trace[469339106]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (31-Aug-2021 05:59:25.405) (total time: 30000ms):
Trace[469339106]: [30.000806943s] [30.000806943s] END
E0831 05:59:55.406381       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0831 05:59:55.406397       1 trace.go:205] Trace[436340495]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (31-Aug-2021 05:59:25.405) (total time: 30000ms):
Trace[436340495]: [30.000800448s] [30.000800448s] END
E0831 05:59:55.406443       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get "https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0831 05:59:55.406501       1 trace.go:205] Trace[774965466]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (31-Aug-2021 05:59:25.405) (total time: 30000ms):
Trace[774965466]: [30.000768124s] [30.000768124s] END
E0831 05:59:55.406562       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[ERROR] plugin/errors: 2 6061867283059196170.1749449744353646990. HINFO: read udp 192.168.171.65:56910->147.75.207.208:53: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
[ERROR] plugin/errors: 2 6061867283059196170.1749449744353646990. HINFO: read udp 192.168.171.65:41558->147.75.207.208:53: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
I0831 06:00:26.444201       1 trace.go:205] Trace[443632888]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (31-Aug-2021 05:59:56.443) (total time: 30000ms):
Trace[443632888]: [30.000714943s] [30.000714943s] END
E0831 06:00:26.444300       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0831 06:00:26.663892       1 trace.go:205] Trace[1496193015]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (31-Aug-2021 05:59:56.663) (total time: 30000ms):
Trace[1496193015]: [30.000666081s] [30.000666081s] END
E0831 06:00:26.663926       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0831 06:00:26.898441       1 trace.go:205] Trace[60780408]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (31-Aug-2021 05:59:56.897) (total time: 30000ms):
Trace[60780408]: [30.00063881s] [30.00063881s] END
E0831 06:00:26.898474       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get "https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
I0831 06:00:58.754593       1 trace.go:205] Trace[1304066831]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (31-Aug-2021 06:00:28.752) (total time: 30002ms):
Trace[1304066831]: [30.002491082s] [30.002491082s] END
E0831 06:00:58.754645       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get "https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0831 06:00:59.102994       1 trace.go:205] Trace[170625356]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (31-Aug-2021 06:00:29.102) (total time: 30000ms):
Trace[170625356]: [30.000669079s] [30.000669079s] END
E0831 06:00:59.103055       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
...

Answer 1 · 2021-08-31T09:13:10.000Z

I am not sure if it should be located here or under projectcalico/calico, so please correct me if I am wrong.

Answer 2 · 2021-08-31T15:25:05.000Z

Hi @Bolodya1997 , you're in the right place 🙂

Thanks for the detailed report, this is very helpful! I noticed you used a public IP for the apiserver (--apiserver-advertise-address=147.75.38.85 in kubeadm init). Could you try with the IP address configured on eno2 on the master (10.0.0.1)?

If that doesn't fix the issue, it would be helpful if you could install calivppctl and attach the output of calivppctl export: https://docs.projectcalico.org/maintenance/troubleshoot/vpp

Answer 3 · 2021-09-01T03:55:20.000Z

Thank you!
I was afraid that k8s API server is listening only on the IP address given in --apiserver-advertise-address=, but it actually listens on all IP addresses.
So after changing this property to 10.0.0.1 it becomes only a question how to make TLS certificates work for the public IP, but it can be easy solved with this - https://blog.scottlowe.org/2019/07/30/adding-a-name-to-kubernetes-api-server-certificate/.