microk8s join command failed on 3rd node in a 3 node cluster
gautamgadipudi-hpe opened this issue · 2 comments
Build details
We are using microk8s FIPS package (version 1.28.13).
This snap package is created manually by cloning the 1.28
branch of microk8s and cherry-picking the FIPS commit - a559109. Then building the snap package using below commands:
sed -i 's/^KUBE_VERSION=.*/KUBE_VERSION=v1.28.13/' ./build-scripts/components/kubernetes/version.sh
# change version of golang to go-1.21-fips in snap/snapcraft.yaml
sed -i 's/pause:3.7/pause:3.9/g' build-scripts/images.txt
sed -i 's/pause:3.7/pause:3.9/g' microk8s-resources/default-args/containerd-template.toml
sudo SNAPCRAFT_BUILD_ENVIRONMENT=host snapcraft
Summary
We are trying to setup a 3 node cluster (1 controller and 2 master nodes)
The controller node and the first master node are up, but the second master node failed to join the cluster.
root@sc-os-175-node2:/var/log# kubectl get nodes
NAME STATUS ROLES AGE VERSION
sc-os-175-node1.glcpdev.cloud.hpe.com Ready <none> 120m v1.28.13
sc-os-175-node2.glcpdev.cloud.hpe.com Ready <none> 111m v1.28.13
root@sc-os-175-node3:/var/log# microk8s status
microk8s is not running. Use microk8s.inspect for a deeper inspection.
Below are the microk8s.daemon-cluster-agent
errors on the 3rd node when it tried to join the cluster.
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: 2024/10/01 01:04:18 Applying /var/snap/microk8s/common/etc/launcher/install.yaml
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: Contacting cluster at sc-os-175-node1.glcpdev.cloud.hpe.com
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: Traceback (most recent call last):
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: File "/snap/microk8s/x1/scripts/wrappers/join.py", line 1008, in <module>
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: join(prog_name="microk8s join")
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: File "/snap/microk8s/x1/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: return self.main(*args, **kwargs)
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: File "/snap/microk8s/x1/usr/lib/python3/dist-packages/click/core.py", line 717, in main
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: rv = self.invoke(ctx)
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: File "/snap/microk8s/x1/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: return ctx.invoke(self.callback, **ctx.params)
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: File "/snap/microk8s/x1/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: return callback(*args, **kwargs)
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: File "/snap/microk8s/x1/scripts/wrappers/join.py", line 999, in join
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: join_dqlite(connection_parts, verify, worker)
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: File "/snap/microk8s/x1/scripts/wrappers/join.py", line 688, in join_dqlite
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: join_dqlite_master_node(info, master_ip)
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: File "/snap/microk8s/x1/scripts/wrappers/join.py", line 838, in join_dqlite_master_node
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: update_dqlite(info["cluster_cert"], info["cluster_key"], info["voters"], hostname_override)
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: File "/snap/microk8s/x1/scripts/wrappers/join.py", line 596, in update_dqlite
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: with open("{}/info.yaml".format(cluster_backup_dir)) as f:
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: FileNotFoundError: [Errno 2] No such file or directory: '/var/snap/microk8s/x1/var/kubernetes/backend.backup/info.yaml'
Oct 1 01:04:18 localhost microk8s.daemon-cluster-agent[19310]: 2024/09/30 18:04:18 Failed to apply configuration file /var/snap/microk8s/common/etc/launcher/install.yaml: failed to apply config part 0: failed to join cluster: failed to execute microk8s join command: command [/snap/microk8s/x1/microk8s-join.wrapper sc-os-175-node1.glcpdev.cloud.hpe.com:25000/e3b0c44298fc1c149afbf4c8996fb924] failed with exit code 1: exit status 1
What Should Happen Instead?
Oct 8 08:03:56 sc-os-175-node3 microk8s.daemon-cluster-agent[17622]: 2024/10/08 08:03:56 Applying /var/snap/microk8s/common/etc/launcher/install.yaml
Oct 8 08:04:13 sc-os-175-node3 microk8s.daemon-cluster-agent[18434]: Contacting cluster at sc-os-175-node1.glcpdev.cloud.hpe.com
Oct 8 08:04:22 sc-os-175-node3 microk8s.daemon-cluster-agent[18434]: Waiting for this node to finish joining the cluster. .. .. .. ..
Oct 8 08:04:22 sc-os-175-node3 microk8s.daemon-cluster-agent[17622]: 2024/10/08 08:04:22 Successfully applied /var/snap/microk8s/common/etc/launcher/install.yaml
Oct 8 08:04:26 sc-os-175-node3 systemd[1]: Stopping Service for snap application microk8s.daemon-cluster-agent...
Oct 8 08:04:26 sc-os-175-node3 systemd[1]: snap.microk8s.daemon-cluster-agent.service: Deactivated successfully.
Oct 8 08:04:26 sc-os-175-node3 systemd[1]: Stopped Service for snap application microk8s.daemon-cluster-agent.
Oct 8 08:04:26 sc-os-175-node3 systemd[1]: snap.microk8s.daemon-cluster-agent.service: Consumed 3.709s CPU time.
Reproduction Steps
This seems like an intermittent issue. We were able the cluster up and running in the next iteration.
Introspection Report
inspection-report-20240930_191036.tar.gz
inspection-report-20240930_191723.tar.gz
inspection-report-20240930_191728.tar.gz
Hey @gautamgadipudi-hpe,
Thank you for reporting this issue. Is there a script used for joining/creating the cluster?
I believe the issue might be related to dqlite not having enough time to initialize and update the info.yaml
file which is used in assembling the list of cluster members when joining. We'll try to see if this is reproducable.
Using ansible scripts, we use launch configuration files with tokens to form the cluster. We start microk8s on 1 node at a time and wait for that to stabilize by checking microk8s status --wait-ready
and microk8s kubectl wait --for=condition=Ready node --all
output.
Please let us know if there is any other recommendations.