Multi controller install fails on OL9

Question

Multi controller install fails on OL9

Gusymochis opened this issue 7 months ago · 8 comments

When attempting to install k0s via k0sctl using the multi-controller setup it fails to install, this doesn't happen if only one controller (or controller+worker) and the rest of the nodes are workers. I have tested using node local load balancing and no load balancing but same issue arrises on both cases.

System Information:
os-release

NAME="Oracle Linux Server"
VERSION="9.3"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Oracle Linux Server 9.3"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:9:3:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://github.com/oracle/oracle-linux"

ORACLE_BUGZILLA_PRODUCT="Oracle Linux 9"
ORACLE_BUGZILLA_PRODUCT_VERSION=9.3
ORACLE_SUPPORT_PRODUCT="Oracle Linux"
ORACLE_SUPPORT_PRODUCT_VERSION=9.3

kernel: Linux fwd-oracle 5.14.0-362.13.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Dec 21 22:34:57 PST 2023 x86_64 x86_64 x86_64 GNU/Linux

k0sctl config:

apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
  name: k0s-cluster
spec:
  hosts:
  - ssh:
      address: 192.168.15.216
      user: user
      port: 22
      keyPath: /home/user/.ssh/id_ed25519
    hostname: mc-poc-m1
    role: controller+worker
    uploadBinary: true
    k0sBinaryPath: /usr/local/bin/k0s
    files:
    - src: /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
      dstDir: /var/lib/k0s/images/
      perm: 075
  - ssh:
      address: 192.168.14.186
      user: user
      port: 22
      keyPath: /home/user/.ssh/id_ed25519
    hostname: mc-poc-m2
    role: controller+worker
    uploadBinary: true
    k0sBinaryPath: /usr/local/bin/k0s
    files:
    - src: /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
      dstDir: /var/lib/k0s/images/
      perm: 075
  - ssh:
      address: 192.168.15.88
      user: user
      port: 22
      keyPath: /home/user/.ssh/id_ed25519
    hostname: mc-poc-m3
    role: controller+worker
    uploadBinary: true
    k0sBinaryPath: /usr/local/bin/k0s
    files:
    - src: /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
      dstDir: /var/lib/k0s/images/
      perm: 075
  - ssh:
      address: 192.168.14.252
      user: user
      port: 22
      keyPath: /home/user/.ssh/id_ed25519
    hostname: mc-poc-wq
    role: worker
    uploadBinary: true
    k0sBinaryPath: /usr/local/bin/k0s
    files:
    - src: /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
      dstDir: /var/lib/k0s/images/
      perm: 075
  - ssh:
      address: 192.168.15.131
      user: user
      port: 22
      keyPath: /home/user/.ssh/id_ed25519
    hostname: mc-poc-w2
    role: worker
    uploadBinary: true
    k0sBinaryPath: /usr/local/bin/k0s
    files:
    - src: /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
      dstDir: /var/lib/k0s/images/
      perm: 075
  k0s:
    version: v1.28.5+k0s.0
    dynamicConfig: false
    config:
      spec:
        network:
          calico:
            mode: vxlan
            overlay: always
            vxlanPort: 4789
            vxlanVNI: 4096
            mtu: 0
            wireguard: true
          clusterDomain: cluster.local
          dualStack: {}
          kubeProxy:
            mode: iptables
          podCIDR: 10.244.0.0/16
          provider: calico
          serviceCIDR: 10.96.0.0/12
          nodeLocalLoadBalancing:
            enabled: true
            type: EnvoyProxy
        telemetry:
          enabled: false
      status: {}

logs:
k0sctl.log

Based on this issue: k0sproject/k0s#3337 (comment)

Answer 1 · 2024-01-30T08:09:21.000Z

This is probably unrelated, but:

    - src: /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
      dstDir: /var/lib/k0s/images/
      perm: 075

That will make the permissions ---rwxr-x, I suppose it's not a problem when running as root/sudo.

Answer 2 · 2024-01-30T11:51:10.000Z

In the logs, I see these:

time="29 Jan 24 17:35 UTC" level=debug msg="retrying, attempt 8 - last error: command failed: client exec: ssh session wait: Process exited with status 7"
time="29 Jan 24 17:35 UTC" level=debug msg="[ssh] 192.168.14.186:22: executing `curl -kso /dev/null --connect-timeout 20 -w \"%{http_code}\" \"https://localhost:6443/version\"`"

Based on that, it seems the second controller is having tough time joining the cluster. I'd look into the status of k0s in that node to see if there's any hints on why it's having tough time. Log into that machine and look into the logs:

journalctl -u k0scontroller ...

@kke would it be possible/make sense if k0sctl could do something like this automatically when it sees k0s is not getting up as expected?

Answer 3 · 2024-01-31T03:12:36.000Z

I re-ran it again and I noticed that the script failed a bit faster while trying to acquire a lock, adding to this there was no logs in the nodes due to never reaching the step where the node is installed or setup.

https://k0sproject.io/licenses/eula
INFO ==> Running phase: Connect to hosts
INFO [ssh] 192.168.15.216:22: connected
INFO [ssh] 192.168.14.252:22: connected
INFO [ssh] 192.168.15.131:22: connected
INFO [ssh] 192.168.14.186:22: connected
INFO [ssh] 192.168.15.88:22: connected
INFO ==> Running phase: Detect host operating systems
INFO [ssh] 192.168.15.131:22: is running Oracle Linux Server 9.3
INFO [ssh] 192.168.15.216:22: is running Oracle Linux Server 9.3
INFO [ssh] 192.168.14.186:22: is running Oracle Linux Server 9.3
INFO [ssh] 192.168.15.88:22: is running Oracle Linux Server 9.3
INFO [ssh] 192.168.14.252:22: is running Oracle Linux Server 9.3
INFO ==> Running phase: Acquire exclusive host lock
INFO ==> Running phase: Prepare hosts
INFO ==> Running phase: Gather host facts
INFO [ssh] 192.168.14.186:22: using mc-poc-m2 from configuration as hostname
INFO [ssh] 192.168.15.216:22: using mc-poc-m1 from configuration as hostname
INFO [ssh] 192.168.14.252:22: using mc-poc-wq from configuration as hostname
INFO [ssh] 192.168.15.131:22: using mc-poc-w2 from configuration as hostname
INFO [ssh] 192.168.15.88:22: using mc-poc-m3 from configuration as hostname
INFO [ssh] 192.168.14.186:22: discovered eth0 as private interface
INFO [ssh] 192.168.15.216:22: discovered eth0 as private interface
INFO [ssh] 192.168.15.131:22: discovered eth0 as private interface
INFO [ssh] 192.168.14.252:22: discovered eth0 as private interface
INFO [ssh] 192.168.15.88:22: discovered eth0 as private interface
INFO ==> Running phase: Validate hosts
INFO ==> Running phase: Gather k0s facts
INFO [ssh] 192.168.15.216:22: found existing configuration
INFO [ssh] 192.168.14.186:22: found existing configuration
INFO [ssh] 192.168.15.88:22: found existing configuration
INFO ==> Running phase: Validate facts
INFO ==> Running phase: Upload files to hosts
INFO [ssh] 192.168.15.131:22: uploading /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
INFO [ssh] 192.168.15.216:22: uploading /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
INFO [ssh] 192.168.14.186:22: uploading /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
INFO [ssh] 192.168.15.88:22: uploading /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
INFO [ssh] 192.168.14.252:22: uploading /var/lib/k0s/images/k0s-airgap-bundle-v1.28.5.tar
INFO [ssh] 192.168.15.131:22: file already exists and hasn't been changed, skipping upload
INFO [ssh] 192.168.14.186:22: file already exists and hasn't been changed, skipping upload
INFO [ssh] 192.168.15.216:22: file already exists and hasn't been changed, skipping upload
INFO [ssh] 192.168.14.252:22: file already exists and hasn't been changed, skipping upload
INFO [ssh] 192.168.15.88:22: file already exists and hasn't been changed, skipping upload
INFO [ssh] 192.168.15.216:22: validating configuration
INFO [ssh] 192.168.14.186:22: validating configuration
INFO [ssh] 192.168.15.88:22: validating configuration
INFO ==> Running phase: Initialize the k0s cluster
INFO [ssh] 192.168.15.216:22: installing k0s controller
INFO * Running clean-up for phase: Acquire exclusive host lock
INFO * Running clean-up for phase: Initialize the k0s cluster
INFO [ssh] 192.168.15.216:22: cleaning up
INFO ==> Apply failed

Answer 4 · 2024-01-31T07:25:58.000Z

@kke would it be possible/make sense if k0sctl could do something like this automatically when it sees k0s is not getting up as expected?

Hmm, interesting idea, so it would try to dig up some diagnostics logs on failure 🤔 That could be handy.

Answer 5 · 2024-01-31T07:28:24.000Z

INFO ==> Running phase: Initialize the k0s cluster
INFO [ssh] 192.168.15.216:22: installing k0s controller
INFO * Running clean-up for phase: Acquire exclusive host lock
INFO * Running clean-up for phase: Initialize the k0s cluster
INFO [ssh] 192.168.15.216:22: cleaning up
INFO ==> Apply failed

No error displayed? That's not nice.

The lock file is just for avoiding two instances of k0sctl operating at the same time, maybe it should be more quiet about it. The actual problem is somewhere else.

Answer 6 · 2024-01-31T19:35:41.000Z

New discovery! I copied the install command and ran it as a standalone command in the server where the logs specified (without escaping) and it is causing a Null Pointer Exception.

log line:

time="31 Jan 24 19:16 UTC" level=debug msg="[ssh] 192.168.15.216:22: executing `sudo -s -- /usr/local/bin/k0s install controller --data-dir=/var/lib/k0s --enable-worker --config \"/etc/k0s/k0s.yaml\" --kubelet-extra-args=\"--hostname-override=mc-poc-m1\"`"

the current content of /etc/k0s/k0s.yaml is the following:

apiVersion: k0s.k0sproject.io/v1beta1
kind: ClusterConfig
spec:
  api:
    address: 192.168.15.216
    sans:
    - 192.168.15.216
    - 192.168.14.186
    - 192.168.15.88
    - 127.0.0.1
  controllerManager: {}
  extensions: null
  installConfig: null
  konnectivity:
    adminPort: 8133
    agentPort: 8132
  network:
    calico:
      mode: vxlan
      mtu: 0
      overlay: always
      vxlanPort: 4789
      vxlanVNI: 4096
      wireguard: true
    clusterDomain: cluster.local
    dualStack: {}
    kubeProxy:
      mode: iptables
    podCIDR: 10.244.0.0/16
    provider: calico
    serviceCIDR: 10.96.0.0/12
  podSecurityPolicy:
    defaultPolicy: 00-k0s-privileged
  scheduler: {}
  telemetry:
    enabled: false
status: {}

command without escaping:
sudo -s -- /usr/local/bin/k0s install controller --data-dir=/var/lib/k0s --enable-worker --config /etc/k0s/k0s.yaml --kubelet-extra-args="--hostname-override=mc-poc-m1"

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2b2c179]

goroutine 1 [running]:
github.com/k0sproject/k0s/pkg/install.CreateControllerUsers(0x6b?, 0xc0003b7680)
	/go/src/github.com/k0sproject/k0s/pkg/install/users.go:41 +0x39
github.com/k0sproject/k0s/cmd/install.(*command).setup(0xc001345ce0, {0x3935e03, 0xa}, {0xc00139eeb0, 0x5, 0x5}, 0xc000673e40)
	/go/src/github.com/k0sproject/k0s/cmd/install/install.go:68 +0xca
github.com/k0sproject/k0s/cmd/install.installControllerCmd.func1(0xc00127d500, {0x3915737?, 0x5?, 0x5?})
	/go/src/github.com/k0sproject/k0s/cmd/install/controller.go:62 +0x197
github.com/spf13/cobra.(*Command).execute(0xc00127d500, {0xc00139e0a0, 0x5, 0x5})
	/run/k0s-build/go/mod/github.com/spf13/cobra@v1.7.0/command.go:940 +0x862
github.com/spf13/cobra.(*Command).ExecuteC(0xc001268300)
	/run/k0s-build/go/mod/github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
	/run/k0s-build/go/mod/github.com/spf13/cobra@v1.7.0/command.go:992
github.com/k0sproject/k0s/cmd.Execute()
	/go/src/github.com/k0sproject/k0s/cmd/root.go:194 +0x1e
main.main()
	/go/src/github.com/k0sproject/k0s/main.go:43 +0x225

I think this is great as we are not running blind anymore.

Answer 7 · 2024-02-01T00:34:19.000Z

Great news!, I was able to solve the issues. The above issue is due too a validation error while processing the yaml configuration, basically installConfig: null causes the yaml to generate a users list to be a validation here could probably solve this issue to re-assign the default users in case the value is nil.

After this I faced another issue with etcd where I noticed that the the systemctl service was using the os hostname and not the hostname from the configuration.
etcd --peer-trusted-ca-file=/var/lib/k0s/pki/etcd/ca.crt --peer-key-file=/var/lib/k0s/pki/etcd/peer.key --log-level=info --peer-client-cert-auth=true --enable-pprof=false --name=fwd-oracle, the issue here is that all nodes had the same hostname and this was breaking etcd, this also explains why when running on one controller + multi-worker it was working with no issues. I guess as a validation for k0sctl a check to verify that no multiple hosts with the same hostname are used as it will break etcd.

I'm also using this value in the config but I'm not sure if the value is being used internally in k0s or it is setting the hostname in the OS.

Please feel free to close this issue or keep it open to track the validations.

Answer 8 · 2024-02-01T07:50:36.000Z

The above issue is due too a validation error while processing the yaml configuration, basically installConfig: null causes the yaml to generate a users list to be a validation here could probably solve this issue to re-assign the default users in case the value is nil.

That should already be happening here - I haven't figured out yet why it isn't.

I guess as a validation for k0sctl a check to verify that no multiple hosts with the same hostname are used as it will break etcd.

That should be validated already:

https://github.com/k0sproject/k0sctl/blob/main/phase/validate_hosts.go#L54-L60

I'm also using this value in the config but I'm not sure if the value is being used internally in k0s or it is setting the hostname in the OS.

That is only used as --kubelet-extra-args="--hostname-override=<hostname>" when installing k0s (and as the hostname when querying node status). It is not set to the os.

K0s does not look at that when starting etcd but will always use os.Hostname().

It looks like you found two k0s bugs 🥇