kinvolk/kube-spawn

fails to start with a timeout with Kubernetes 1.11

alban opened this issue · 5 comments

alban commented

To Reproduce:

  • Install Fedora 28 from https://cloud.fedoraproject.org/ (GP2 image) on AWS:
    • m4.large
    • Disk: at least 50GiB
    • ssh: ssh -i ~/.ssh/$KEY fedora@$IP
  • Start a kube-spawn Kubernetes cluster on the AWS EC2 instance:
export KUBERNETES_VERSION=v1.9.9 # or other version
export KUBERNETES_VERSION=v1.10.5 # or other version
export KUBERNETES_VERSION=v1.11.0 # or other version
export KUBE_SPAWN_VERSION=master # FIXME

## Workarounds
sudo setenforce 0

## Install dependencies
sudo dnf install -y btrfs-progs git go iptables libselinux-utils polkit qemu-img systemd-container make docker
mkdir go
export GOPATH=$HOME/go
curl -fsSL -O https://github.com/containernetworking/plugins/releases/download/v0.6.0/cni-plugins-amd64-v0.6.0.tgz
sudo mkdir -p /opt/cni/bin
sudo tar -C /opt/cni/bin -xvf cni-plugins-amd64-v0.6.0.tgz
sudo curl -Lo /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/${KUBERNETES_VERSION}/bin/linux/amd64/kubectl
sudo chmod +x /usr/local/bin/kubectl

## Compile and install
mkdir -p $GOPATH/src/github.com/kinvolk
cd $GOPATH/src/github.com/kinvolk
git clone https://github.com/kinvolk/kube-spawn.git
cd kube-spawn/
git checkout $KUBE_SPAWN_VERSION
make DOCKERIZED=n
sudo make install

## First attempt to use kube-spawn
cd
sudo -E kube-spawn create --kubernetes-version $KUBERNETES_VERSION
sudo -E kube-spawn start --nodes=3
sudo -E kube-spawn destroy

## Workaround for "no space left on device": https://github.com/kinvolk/kube-spawn/issues/281
sudo umount /var/lib/machines
sudo qemu-img resize -f raw /var/lib/machines.raw $((10*1024*1024*1024))
sudo mount -t btrfs -o loop /var/lib/machines.raw /var/lib/machines
sudo btrfs filesystem resize max /var/lib/machines
sudo btrfs quota disable /var/lib/machines

## Start kube-spawn
cd
sudo -E kube-spawn create --kubernetes-version $KUBERNETES_VERSION
sudo -E kube-spawn start --nodes=3

Then the error message:

Download of https://alpha.release.flatcar-linux.net/amd64-usr/current/flatcar_developer_container.bin.bz2 complete.
Created new local image 'flatcar'.
Operation completed successfully.
Exiting.
nf_conntrack module is not loaded: stat /sys/module/nf_conntrack/parameters/hashsize: no such file or directory
Warning: nf_conntrack module is not loaded.
loading nf_conntrack module... 
making iptables FORWARD chain defaults to ACCEPT...
setting iptables rule to allow CNI traffic...
Starting 3 nodes in cluster default ...
Waiting for machine kube-spawn-default-worker-fjxan9 to start up ...
Waiting for machine kube-spawn-default-master-5y7clq to start up ...
Waiting for machine kube-spawn-default-worker-2ujr2f to start up ...
Started kube-spawn-default-worker-2ujr2f
Bootstrapping kube-spawn-default-worker-2ujr2f ...
Started kube-spawn-default-master-5y7clq
Bootstrapping kube-spawn-default-master-5y7clq ...
Cluster "default" started
Failed to start machine kube-spawn-default-worker-fjxan9: timeout waiting for "kube-spawn-default-worker-fjxan9" to start
Note: `kubeadm init` can take several minutes
master-5y7clq I0630 14:22:29.999557     380 feature_gate.go:230] feature gates: &{map[]}
              [init] using Kubernetes version: v1.11.0
              [preflight] running pre-flight checks
              [WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service'
              [WARNING FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
              [WARNING FileExisting-crictl]: crictl not found in system path
              I0630 14:22:30.050775     380 kernel_validator.go:81] Validating kernel version
              I0630 14:22:30.051083     380 kernel_validator.go:96] Validating kernel config
              [WARNING SystemVerification]: docker version is greater than the most recently validated version. Docker version: 18.05.0-ce. Max validated version: 17.03
              [WARNING Hostname]: hostname "kube-spawn-default-master-5y7clq" could not be reached
              [WARNING Hostname]: hostname "kube-spawn-default-master-5y7clq" lookup kube-spawn-default-master-5y7clq on 8.8.8.8:53: no such host
              reflight/images] Pulling images required for setting up a Kubernetes cluster
              [preflight/images] This might take a minute or two, depending on the speed of your internet connection
              [preflight/images] You can also perform this action in beforehand using 'kubeadm config images pull'
              [kubelet] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
              [kubelet] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
              [preflight] Activating the kubelet service
              [certificates] Generated ca certificate and key.
              [certificates] Generated apiserver certificate and key.
              [certificates] apiserver serving cert is signed for DNS names [kube-spawn-default-master-5y7clq kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.22.0.3]
              [certificates] Generated apiserver-kubelet-client certificate and key.
              [certificates] Generated sa key and public key.
              [certificates] Generated front-proxy-ca certificate and key.
              [certificates] Generated front-proxy-client certificate and key.
              [certificates] Generated etcd/ca certificate and key.
              [certificates] Generated etcd/server certificate and key.
              [certificates] etcd/server serving cert is signed for DNS names [kube-spawn-default-master-5y7clq localhost] and IPs [127.0.0.1 ::1]
              [certificates] Generated etcd/peer certificate and key.
              [certificates] etcd/peer serving cert is signed for DNS names [kube-spawn-default-master-5y7clq localhost] and IPs [10.22.0.3 127.0.0.1 ::1]
              [certificates] Generated etcd/healthcheck-client certificate and key.
              [certificates] Generated apiserver-etcd-client certificate and key.
              [certificates] valid certificates and keys now exist in "/etc/kubernetes/pki"
              [kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/admin.conf"
              [kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/kubelet.conf"
              [kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/controller-manager.conf"
              [kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/scheduler.conf"
              [controlplane] wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/manifests/kube-apiserver.yaml"
              [controlplane] wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
              [controlplane] wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/manifests/kube-scheduler.yaml"
              [etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/manifests/etcd.yaml"
              [init] waiting for the kubelet to boot up the control plane as Static Pods from directory "/etc/kubernetes/manifests"
              [init] this might take a minute or longer if the control plane images have to be pulled
              [apiclient] All control plane components are healthy after 42.001677 seconds
              [uploadconfig] storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
              [kubelet] Creating a ConfigMap "kubelet-config-1.11" in namespace kube-system with the configuration for the kubelets in the cluster
              [markmaster] Marking the node kube-spawn-default-master-5y7clq as master by adding the label "node-role.kubernetes.io/master=''"
              [markmaster] Marking the node kube-spawn-default-master-5y7clq as master by adding the taints [node-role.kubernetes.io/master:NoSchedule]
              [patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "kube-spawn-default-master-5y7clq" as an annotation
              [bootstraptoken] using token: 1o71nu.v7s48wncryhbdmm7
              [bootstraptoken] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
              [bootstraptoken] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
              [bootstraptoken] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
              [bootstraptoken] creating the "cluster-info" ConfigMap in the "kube-public" namespace
              [addons] Applied essential addon: CoreDNS
              [addons] Applied essential addon: kube-proxy
              Your Kubernetes master has initialized successfully!
              To start using your cluster, you need to run the following as a regular user:
              mkdir -p $HOME/.kube
              sudo cp -i /etc/kubernetes/admin.conf
              $HOME/.kube/config
              sudo chown $(id -u):$(id -g) $HOME/.kube/config
              You should now deploy a pod network to the cluster.
              Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
              https://kubernetes.io/docs/concepts/cluster-administration/addons/
              You can now join any number of machines by running the following on each node
              as root:
              kubeadm join 10.22.0.3:6443 --token 1o71nu.v7s48wncryhbdmm7 --discovery-token-ca-cert-hash sha256:c8ac2337adc7ed01725bed7d78605661dc759257fce213838f1cb89486fe263c
              I0630 14:23:47.569329    1140 feature_gate.go:230] feature gates: &{map[]}
              aaaaaa.bbbbbbbbbbbbbbbb
              serviceaccount/weave-net created
              clusterrole.rbac.authorization.k8s.io/weave-net created
              clusterrolebinding.rbac.authorization.k8s.io/weave-net created
              daemonset.extensions/weave-net created
worker-2ujr2f [preflight] running pre-flight checks
              [WARNING RequiredIPVSKernelModulesAvailable]: the IPVS proxier will not be used, because the following required kernel modules are not loaded: [ip_vs ip_vs_rr ip_vs_wrr ip_vs_sh] or no builtin kernel ipvs support: map[ip_vs:{} ip_vs_rr:{} ip_vs_wrr:{} ip_vs_sh:{} nf_conntrack_ipv4:{}]
              you can solve this problem with following methods:
              1. Run 'modprobe -- ' to load missing kernel modules;
              2. Provide the missing builtin kernel ipvs support
              [WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service'
              [WARNING FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
              [WARNING FileExisting-crictl]: crictl not found in system path
              I0630 14:23:49.919029     449 kernel_validator.go:81] Validating kernel version
              I0630 14:23:49.919338     449 kernel_validator.go:96] Validating kernel config
              [WARNING SystemVerification]: docker version is greater than the most recently validated version. Docker version: 18.05.0-ce. Max validated version: 17.03
              [WARNING Hostname]: hostname "kube-spawn-default-worker-2ujr2f" could not be reached
              [WARNING Hostname]: hostname "kube-spawn-default-worker-2ujr2f" lookup kube-spawn-default-worker-2ujr2f on 8.8.8.8:53: no such host
              [discovery] Trying to connect to API Server "10.22.0.3:6443"
              [discovery] Created cluster-info discovery client, requesting info from "https://10.22.0.3:6443"
              [discovery] Failed to connect to API Server "10.22.0.3:6443": token id "aaaaaa" is invalid for this cluster or it has expired. Use "kubeadm token create" on the master node to creating a new valid token
              [discovery] Trying to connect to API Server "10.22.0.3:6443"
              [discovery] Created cluster-info discovery client, requesting info from "https://10.22.0.3:6443"
              [discovery] Cluster info signature and contents are valid and no TLS pinning was specified, will use API Server "10.22.0.3:6443"
              [discovery] Successfully established connection with API Server "10.22.0.3:6443"
              [kubelet] Downloading configuration for the kubelet from the "kubelet-config-1.11" ConfigMap in the kube-system namespace
              [kubelet] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
              [kubelet] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
              [preflight] Activating the kubelet service
              [tlsbootstrap] Waiting for the kubelet to perform the TLS Bootstrap...
              [patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "kube-spawn-default-worker-2ujr2f" as an annotation
              This node has joined the cluster:
              * Certificate signing request was sent to master and a response
              was received.
              * The Kubelet was informed of the new secure connection details.
              Run 'kubectl get nodes' on the master to see this node join the cluster.
Failed to start cluster: provisioning the worker nodes with kubeadm didn't succeed

More debug info:

$ kubectl get nodes
NAME                               STATUS    ROLES     AGE       VERSION
kube-spawn-default-master-5y7clq   Ready     master    1m        v1.11.0
kube-spawn-default-worker-2ujr2f   Ready     <none>    46s       v1.11.0
$ machinectl 
MACHINE                          CLASS     SERVICE        OS      VERSION  ADDRESSES
kube-spawn-default-master-5y7clq container systemd-nspawn flatcar 1814.0.0 10.22.0.3...
kube-spawn-default-worker-2ujr2f container systemd-nspawn flatcar 1814.0.0 10.22.0.2...

2 machines listed.
$ df -h /var/lib/machines
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0       10G  1.7G  7.8G  18% /var/lib/machines

The third machine does not exist anymore?

alban commented

After a second attempt, it works.

I get this timeout just as @alban described, except it's reproducible every time.

$ kube-spawn start
Warning: kube-proxy could crash due to insufficient nf_conntrack hashsize.
setting nf_conntrack hashsize to 131072... 
making iptables FORWARD chain defaults to ACCEPT...
new poolSize to be : 5490739200
Starting 3 nodes in cluster default ...
Waiting for machine kube-spawn-default-worker-naz6fc to start up ...
Waiting for machine kube-spawn-default-master-yz3twq to start up ...
Waiting for machine kube-spawn-default-worker-u5fu6n to start up ...
Failed to start machine kube-spawn-default-master-yz3twq: timeout waiting for "kube-spawn-default-master-yz3twq" to start
Failed to start machine kube-spawn-default-worker-naz6fc: timeout waiting for "kube-spawn-default-worker-naz6fc" to start
Failed to start cluster: starting the cluster didn't succeed

Note:

  1. I face the same timeout issue, regardless of when I destroy the cluster and start again. Or if I mount a formatted btrfs and redo this.
  2. The first time I launched kube-spawn, it was with a manually formatted and mounted btrfs volume. That's when it complained "machine.raw" not found. So I unmounted and re-ran. So the systemd-nspawn did its job and created a machine.raw. I tried to re-spawn the cluster afterwards, except this time it didn't complain about .raw file obviously. But it timed out regardless.
  3. Even though I've been through the troubleshooting.md guide, SELinux has been a pita and as a result I've had to create about a dozen policies and semanage it all. Not the cake I was digging. pfft

for debugging, is there any place this things logs itself into?


  • kube-spawn v0.3.0
  • FS:
/dev/loop2     btrfs      40G  1.7G   39G   5% /var/lib/machines

OR 

/dev/sda4      btrfs      56G  1.7G   54G   4% /var/lib/machines
  • systemd-container-238-10.git438ac26.fc28.x86_64
  • qemu-img-2.11.2-4.fc28.x86_64
  • machinectl limit to 40G with loopback mount (as evident in the df output above too):
# machinectl show
PoolPath=/var/lib/machines
PoolUsage=1866190848
PoolLimit=42949672960
  • OS: Linux 4.18.17-200.fc28.x86_64 GNU/Linux

ok nevermind.

all I had to do was:

  1. export KUBERNETES_VERSION=v1.12.0 (didn't do it earlier before create step)
  2. kube-spawn destroy
  3. kube-spawn create (this time, it populated /var/lib/kube-spawn/clusters. It was an empty trail of subdirs earlier.)
  4. kube-spawn start

and it works. jeez

Seems to be related to #325.

Seems to be related to #325.

sure, except I didn't destroy it first. Got the timeout from start as per #282 (comment) (so to speak, after creating the cluster)
..then resolved issue with #282 (comment)

apologies if that order in step 2 of resolution comment, created a confusion.

also I can't reproduce it now. :/