fails to start with a timeout with Kubernetes 1.11
alban opened this issue · 5 comments
alban commented
To Reproduce:
- Install Fedora 28 from https://cloud.fedoraproject.org/ (GP2 image) on AWS:
- m4.large
- Disk: at least 50GiB
- ssh:
ssh -i ~/.ssh/$KEY fedora@$IP
- Start a kube-spawn Kubernetes cluster on the AWS EC2 instance:
export KUBERNETES_VERSION=v1.9.9 # or other version
export KUBERNETES_VERSION=v1.10.5 # or other version
export KUBERNETES_VERSION=v1.11.0 # or other version
export KUBE_SPAWN_VERSION=master # FIXME
## Workarounds
sudo setenforce 0
## Install dependencies
sudo dnf install -y btrfs-progs git go iptables libselinux-utils polkit qemu-img systemd-container make docker
mkdir go
export GOPATH=$HOME/go
curl -fsSL -O https://github.com/containernetworking/plugins/releases/download/v0.6.0/cni-plugins-amd64-v0.6.0.tgz
sudo mkdir -p /opt/cni/bin
sudo tar -C /opt/cni/bin -xvf cni-plugins-amd64-v0.6.0.tgz
sudo curl -Lo /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/${KUBERNETES_VERSION}/bin/linux/amd64/kubectl
sudo chmod +x /usr/local/bin/kubectl
## Compile and install
mkdir -p $GOPATH/src/github.com/kinvolk
cd $GOPATH/src/github.com/kinvolk
git clone https://github.com/kinvolk/kube-spawn.git
cd kube-spawn/
git checkout $KUBE_SPAWN_VERSION
make DOCKERIZED=n
sudo make install
## First attempt to use kube-spawn
cd
sudo -E kube-spawn create --kubernetes-version $KUBERNETES_VERSION
sudo -E kube-spawn start --nodes=3
sudo -E kube-spawn destroy
## Workaround for "no space left on device": https://github.com/kinvolk/kube-spawn/issues/281
sudo umount /var/lib/machines
sudo qemu-img resize -f raw /var/lib/machines.raw $((10*1024*1024*1024))
sudo mount -t btrfs -o loop /var/lib/machines.raw /var/lib/machines
sudo btrfs filesystem resize max /var/lib/machines
sudo btrfs quota disable /var/lib/machines
## Start kube-spawn
cd
sudo -E kube-spawn create --kubernetes-version $KUBERNETES_VERSION
sudo -E kube-spawn start --nodes=3
Then the error message:
Download of https://alpha.release.flatcar-linux.net/amd64-usr/current/flatcar_developer_container.bin.bz2 complete.
Created new local image 'flatcar'.
Operation completed successfully.
Exiting.
nf_conntrack module is not loaded: stat /sys/module/nf_conntrack/parameters/hashsize: no such file or directory
Warning: nf_conntrack module is not loaded.
loading nf_conntrack module...
making iptables FORWARD chain defaults to ACCEPT...
setting iptables rule to allow CNI traffic...
Starting 3 nodes in cluster default ...
Waiting for machine kube-spawn-default-worker-fjxan9 to start up ...
Waiting for machine kube-spawn-default-master-5y7clq to start up ...
Waiting for machine kube-spawn-default-worker-2ujr2f to start up ...
Started kube-spawn-default-worker-2ujr2f
Bootstrapping kube-spawn-default-worker-2ujr2f ...
Started kube-spawn-default-master-5y7clq
Bootstrapping kube-spawn-default-master-5y7clq ...
Cluster "default" started
Failed to start machine kube-spawn-default-worker-fjxan9: timeout waiting for "kube-spawn-default-worker-fjxan9" to start
Note: `kubeadm init` can take several minutes
master-5y7clq I0630 14:22:29.999557 380 feature_gate.go:230] feature gates: &{map[]}
[init] using Kubernetes version: v1.11.0
[preflight] running pre-flight checks
[WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service'
[WARNING FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
[WARNING FileExisting-crictl]: crictl not found in system path
I0630 14:22:30.050775 380 kernel_validator.go:81] Validating kernel version
I0630 14:22:30.051083 380 kernel_validator.go:96] Validating kernel config
[WARNING SystemVerification]: docker version is greater than the most recently validated version. Docker version: 18.05.0-ce. Max validated version: 17.03
[WARNING Hostname]: hostname "kube-spawn-default-master-5y7clq" could not be reached
[WARNING Hostname]: hostname "kube-spawn-default-master-5y7clq" lookup kube-spawn-default-master-5y7clq on 8.8.8.8:53: no such host
reflight/images] Pulling images required for setting up a Kubernetes cluster
[preflight/images] This might take a minute or two, depending on the speed of your internet connection
[preflight/images] You can also perform this action in beforehand using 'kubeadm config images pull'
[kubelet] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[preflight] Activating the kubelet service
[certificates] Generated ca certificate and key.
[certificates] Generated apiserver certificate and key.
[certificates] apiserver serving cert is signed for DNS names [kube-spawn-default-master-5y7clq kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.22.0.3]
[certificates] Generated apiserver-kubelet-client certificate and key.
[certificates] Generated sa key and public key.
[certificates] Generated front-proxy-ca certificate and key.
[certificates] Generated front-proxy-client certificate and key.
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [kube-spawn-default-master-5y7clq localhost] and IPs [127.0.0.1 ::1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [kube-spawn-default-master-5y7clq localhost] and IPs [10.22.0.3 127.0.0.1 ::1]
[certificates] Generated etcd/healthcheck-client certificate and key.
[certificates] Generated apiserver-etcd-client certificate and key.
[certificates] valid certificates and keys now exist in "/etc/kubernetes/pki"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/admin.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/controller-manager.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/scheduler.conf"
[controlplane] wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/manifests/kube-apiserver.yaml"
[controlplane] wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
[controlplane] wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/manifests/etcd.yaml"
[init] waiting for the kubelet to boot up the control plane as Static Pods from directory "/etc/kubernetes/manifests"
[init] this might take a minute or longer if the control plane images have to be pulled
[apiclient] All control plane components are healthy after 42.001677 seconds
[uploadconfig] storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config-1.11" in namespace kube-system with the configuration for the kubelets in the cluster
[markmaster] Marking the node kube-spawn-default-master-5y7clq as master by adding the label "node-role.kubernetes.io/master=''"
[markmaster] Marking the node kube-spawn-default-master-5y7clq as master by adding the taints [node-role.kubernetes.io/master:NoSchedule]
[patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "kube-spawn-default-master-5y7clq" as an annotation
[bootstraptoken] using token: 1o71nu.v7s48wncryhbdmm7
[bootstraptoken] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstraptoken] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstraptoken] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstraptoken] creating the "cluster-info" ConfigMap in the "kube-public" namespace
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy
Your Kubernetes master has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf
$HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
You can now join any number of machines by running the following on each node
as root:
kubeadm join 10.22.0.3:6443 --token 1o71nu.v7s48wncryhbdmm7 --discovery-token-ca-cert-hash sha256:c8ac2337adc7ed01725bed7d78605661dc759257fce213838f1cb89486fe263c
I0630 14:23:47.569329 1140 feature_gate.go:230] feature gates: &{map[]}
aaaaaa.bbbbbbbbbbbbbbbb
serviceaccount/weave-net created
clusterrole.rbac.authorization.k8s.io/weave-net created
clusterrolebinding.rbac.authorization.k8s.io/weave-net created
daemonset.extensions/weave-net created
worker-2ujr2f [preflight] running pre-flight checks
[WARNING RequiredIPVSKernelModulesAvailable]: the IPVS proxier will not be used, because the following required kernel modules are not loaded: [ip_vs ip_vs_rr ip_vs_wrr ip_vs_sh] or no builtin kernel ipvs support: map[ip_vs:{} ip_vs_rr:{} ip_vs_wrr:{} ip_vs_sh:{} nf_conntrack_ipv4:{}]
you can solve this problem with following methods:
1. Run 'modprobe -- ' to load missing kernel modules;
2. Provide the missing builtin kernel ipvs support
[WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service'
[WARNING FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
[WARNING FileExisting-crictl]: crictl not found in system path
I0630 14:23:49.919029 449 kernel_validator.go:81] Validating kernel version
I0630 14:23:49.919338 449 kernel_validator.go:96] Validating kernel config
[WARNING SystemVerification]: docker version is greater than the most recently validated version. Docker version: 18.05.0-ce. Max validated version: 17.03
[WARNING Hostname]: hostname "kube-spawn-default-worker-2ujr2f" could not be reached
[WARNING Hostname]: hostname "kube-spawn-default-worker-2ujr2f" lookup kube-spawn-default-worker-2ujr2f on 8.8.8.8:53: no such host
[discovery] Trying to connect to API Server "10.22.0.3:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://10.22.0.3:6443"
[discovery] Failed to connect to API Server "10.22.0.3:6443": token id "aaaaaa" is invalid for this cluster or it has expired. Use "kubeadm token create" on the master node to creating a new valid token
[discovery] Trying to connect to API Server "10.22.0.3:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://10.22.0.3:6443"
[discovery] Cluster info signature and contents are valid and no TLS pinning was specified, will use API Server "10.22.0.3:6443"
[discovery] Successfully established connection with API Server "10.22.0.3:6443"
[kubelet] Downloading configuration for the kubelet from the "kubelet-config-1.11" ConfigMap in the kube-system namespace
[kubelet] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[preflight] Activating the kubelet service
[tlsbootstrap] Waiting for the kubelet to perform the TLS Bootstrap...
[patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "kube-spawn-default-worker-2ujr2f" as an annotation
This node has joined the cluster:
* Certificate signing request was sent to master and a response
was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the master to see this node join the cluster.
Failed to start cluster: provisioning the worker nodes with kubeadm didn't succeed
More debug info:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kube-spawn-default-master-5y7clq Ready master 1m v1.11.0
kube-spawn-default-worker-2ujr2f Ready <none> 46s v1.11.0
$ machinectl
MACHINE CLASS SERVICE OS VERSION ADDRESSES
kube-spawn-default-master-5y7clq container systemd-nspawn flatcar 1814.0.0 10.22.0.3...
kube-spawn-default-worker-2ujr2f container systemd-nspawn flatcar 1814.0.0 10.22.0.2...
2 machines listed.
$ df -h /var/lib/machines
Filesystem Size Used Avail Use% Mounted on
/dev/loop0 10G 1.7G 7.8G 18% /var/lib/machines
The third machine does not exist anymore?
alban commented
After a second attempt, it works.
arcolife commented
I get this timeout just as @alban described, except it's reproducible every time.
$ kube-spawn start
Warning: kube-proxy could crash due to insufficient nf_conntrack hashsize.
setting nf_conntrack hashsize to 131072...
making iptables FORWARD chain defaults to ACCEPT...
new poolSize to be : 5490739200
Starting 3 nodes in cluster default ...
Waiting for machine kube-spawn-default-worker-naz6fc to start up ...
Waiting for machine kube-spawn-default-master-yz3twq to start up ...
Waiting for machine kube-spawn-default-worker-u5fu6n to start up ...
Failed to start machine kube-spawn-default-master-yz3twq: timeout waiting for "kube-spawn-default-master-yz3twq" to start
Failed to start machine kube-spawn-default-worker-naz6fc: timeout waiting for "kube-spawn-default-worker-naz6fc" to start
Failed to start cluster: starting the cluster didn't succeed
Note:
- I face the same timeout issue, regardless of when I destroy the cluster and start again. Or if I mount a formatted btrfs and redo this.
- The first time I launched kube-spawn, it was with a manually formatted and mounted btrfs volume. That's when it complained "machine.raw" not found. So I unmounted and re-ran. So the systemd-nspawn did its job and created a machine.raw. I tried to re-spawn the cluster afterwards, except this time it didn't complain about .raw file obviously. But it timed out regardless.
- Even though I've been through the troubleshooting.md guide, SELinux has been a pita and as a result I've had to create about a dozen policies and semanage it all. Not the cake I was digging. pfft
for debugging, is there any place this things logs itself into?
- kube-spawn v0.3.0
- FS:
/dev/loop2 btrfs 40G 1.7G 39G 5% /var/lib/machines
OR
/dev/sda4 btrfs 56G 1.7G 54G 4% /var/lib/machines
systemd-container-238-10.git438ac26.fc28.x86_64
qemu-img-2.11.2-4.fc28.x86_64
- machinectl limit to 40G with loopback mount (as evident in the df output above too):
# machinectl show
PoolPath=/var/lib/machines
PoolUsage=1866190848
PoolLimit=42949672960
- OS:
Linux 4.18.17-200.fc28.x86_64 GNU/Linux
arcolife commented
ok nevermind.
all I had to do was:
- export KUBERNETES_VERSION=v1.12.0 (didn't do it earlier before
create
step) - kube-spawn destroy
- kube-spawn create (this time, it populated
/var/lib/kube-spawn/clusters
. It was an empty trail of subdirs earlier.) - kube-spawn start
and it works. jeez
arcolife commented
Seems to be related to #325.
sure, except I didn't destroy it first. Got the timeout from start
as per #282 (comment) (so to speak, after creating the cluster)
..then resolved issue with #282 (comment)
apologies if that order in step 2 of resolution comment, created a confusion.
also I can't reproduce it now. :/