canonical/microk8s

microk8s join command failed on 3rd node in a 3 node cluster

gautamgadipudi-hpe opened this issue · 2 comments

Build details

We are using microk8s FIPS package (version 1.28.13).
This snap package is created manually by cloning the 1.28 branch of microk8s and cherry-picking the FIPS commit - a559109. Then building the snap package using below commands:

sed -i 's/^KUBE_VERSION=.*/KUBE_VERSION=v1.28.13/' ./build-scripts/components/kubernetes/version.sh

# change version of golang to go-1.21-fips in snap/snapcraft.yaml

sed -i 's/pause:3.7/pause:3.9/g' build-scripts/images.txt
sed -i 's/pause:3.7/pause:3.9/g' microk8s-resources/default-args/containerd-template.toml
sudo SNAPCRAFT_BUILD_ENVIRONMENT=host snapcraft

Summary

We are trying to setup a 3 node cluster (1 controller and 2 master nodes)

The controller node and the first master node are up, but the second master node failed to join the cluster.

root@sc-os-175-node2:/var/log# kubectl get nodes
NAME                                    STATUS   ROLES    AGE    VERSION
sc-os-175-node1.glcpdev.cloud.hpe.com   Ready    <none>   120m   v1.28.13
sc-os-175-node2.glcpdev.cloud.hpe.com   Ready    <none>   111m   v1.28.13
root@sc-os-175-node3:/var/log# microk8s status
microk8s is not running. Use microk8s.inspect for a deeper inspection.

Below are the microk8s.daemon-cluster-agent errors on the 3rd node when it tried to join the cluster.

Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: 2024/10/01 01:04:18 Applying /var/snap/microk8s/common/etc/launcher/install.yaml
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: Contacting cluster at sc-os-175-node1.glcpdev.cloud.hpe.com
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: Traceback (most recent call last):
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:   File "/snap/microk8s/x1/scripts/wrappers/join.py", line 1008, in <module>
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:     join(prog_name="microk8s join")
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:   File "/snap/microk8s/x1/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:     return self.main(*args, **kwargs)
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:   File "/snap/microk8s/x1/usr/lib/python3/dist-packages/click/core.py", line 717, in main
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:     rv = self.invoke(ctx)
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:   File "/snap/microk8s/x1/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:     return ctx.invoke(self.callback, **ctx.params)
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:   File "/snap/microk8s/x1/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:     return callback(*args, **kwargs)
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:   File "/snap/microk8s/x1/scripts/wrappers/join.py", line 999, in join
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:     join_dqlite(connection_parts, verify, worker)
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:   File "/snap/microk8s/x1/scripts/wrappers/join.py", line 688, in join_dqlite
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:     join_dqlite_master_node(info, master_ip)
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:   File "/snap/microk8s/x1/scripts/wrappers/join.py", line 838, in join_dqlite_master_node
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:     update_dqlite(info["cluster_cert"], info["cluster_key"], info["voters"], hostname_override)
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:   File "/snap/microk8s/x1/scripts/wrappers/join.py", line 596, in update_dqlite
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]:     with open("{}/info.yaml".format(cluster_backup_dir)) as f:
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[20794]: FileNotFoundError: [Errno 2] No such file or directory: '/var/snap/microk8s/x1/var/kubernetes/backend.backup/info.yaml'
Oct  1 01:04:18 localhost microk8s.daemon-cluster-agent[19310]: 2024/09/30 18:04:18 Failed to apply configuration file /var/snap/microk8s/common/etc/launcher/install.yaml: failed to apply config part 0: failed to join cluster: failed to execute microk8s join command: command [/snap/microk8s/x1/microk8s-join.wrapper sc-os-175-node1.glcpdev.cloud.hpe.com:25000/e3b0c44298fc1c149afbf4c8996fb924] failed with exit code 1: exit status 1

What Should Happen Instead?

Oct  8 08:03:56 sc-os-175-node3 microk8s.daemon-cluster-agent[17622]: 2024/10/08 08:03:56 Applying /var/snap/microk8s/common/etc/launcher/install.yaml
Oct  8 08:04:13 sc-os-175-node3 microk8s.daemon-cluster-agent[18434]: Contacting cluster at sc-os-175-node1.glcpdev.cloud.hpe.com
Oct  8 08:04:22 sc-os-175-node3 microk8s.daemon-cluster-agent[18434]: Waiting for this node to finish joining the cluster. .. .. .. ..
Oct  8 08:04:22 sc-os-175-node3 microk8s.daemon-cluster-agent[17622]: 2024/10/08 08:04:22 Successfully applied /var/snap/microk8s/common/etc/launcher/install.yaml
Oct  8 08:04:26 sc-os-175-node3 systemd[1]: Stopping Service for snap application microk8s.daemon-cluster-agent...
Oct  8 08:04:26 sc-os-175-node3 systemd[1]: snap.microk8s.daemon-cluster-agent.service: Deactivated successfully.
Oct  8 08:04:26 sc-os-175-node3 systemd[1]: Stopped Service for snap application microk8s.daemon-cluster-agent.
Oct  8 08:04:26 sc-os-175-node3 systemd[1]: snap.microk8s.daemon-cluster-agent.service: Consumed 3.709s CPU time.

Reproduction Steps

This seems like an intermittent issue. We were able the cluster up and running in the next iteration.

Introspection Report

inspection-report-20240930_191036.tar.gz
inspection-report-20240930_191723.tar.gz
inspection-report-20240930_191728.tar.gz

Hey @gautamgadipudi-hpe,

Thank you for reporting this issue. Is there a script used for joining/creating the cluster?

I believe the issue might be related to dqlite not having enough time to initialize and update the info.yaml file which is used in assembling the list of cluster members when joining. We'll try to see if this is reproducable.

Using ansible scripts, we use launch configuration files with tokens to form the cluster. We start microk8s on 1 node at a time and wait for that to stabilize by checking microk8s status --wait-ready and microk8s kubectl wait --for=condition=Ready node --all output.
Please let us know if there is any other recommendations.