rancher/rancher

vSphere cluster Waiting for SSH to be available(k3os & Fedora CoreOS)

Closed this issue ยท 17 comments

What kind of request is this (question/bug/enhancement/feature request):
Question

Steps to reproduce (least amount of steps as possible):
Create a vsphere provided cluster with a k3os or Fedora CoreOS node template.

Result:
Cloud-init nevers gets applied and ssh times out after 60 tries. It looks similar to #24948. The only OS that I got it working with was Centos 7. The deploy method is: "Deploy from Datacenter". Can anyone clarify this?

Other details that may be helpful:

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI):
  • Installation option (single install/HA):

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported):
  • Machine type (cloud/VM/metal) and specifications (CPU/memory):
  • Kubernetes version (use kubectl version):
Rancher 2.3.6/HA/VSphere provider
  • Docker version (use docker version):
19.03

@stayfrostnl did you get anywhere with this? I'm having the exact same issue with a Ubuntu 20 template. It creates it, gets it online and then sits on at the 'Waiting for SSH to be available' state before giving up.

@chrispage1 Yes and no :) I got it working with Ubuntu 18.04 as well. So maybe it works for Ubuntu 20. Rancher is using the NoCloud datasource when creating these VM's, see https://cloudinit.readthedocs.io/en/latest/topics/datasources/nocloud.html. So in your case you will need too install cloud-init inside your template. I also remove the the following folder, because Ubuntu is using it during its initial installation :

/var/lib/cloud/instances

Haven't figured out how to fix this for Fedora and k3os tho, but to me it seems that its not working because of the use of the NoClouddatasource.

@stayfrostnl Glad you got there-ish! I found that removing the /var/lib/cloud/instances helped a huge amount. However on Ubuntu 20, the docker installation ended up failing.

I'm considering sticking with Rancher OS nodes, they are much less traumatic to get running and in theory optimised for the job...

@chrispage1 You could probably fix that with making the template with Packer. For Ubuntu to work we also removed the machine-id because of duplicate IP's, found this solution on the internet:

Remove machine-id information to prevent duplicate ips
rm /etc/machine-id
rm /var/lib/dbus/machine-id
truncate -s 0 /etc/machine-id
ln -s /etc/machine-id /var/lib/dbus/machine-id

But for now we are also sticking to RancherOS.

Same behaviour for RancherOS 1.5.5.
I'm curious - why can't it just use guestinfo.cloud-init.config.data VM property to pass cloud-init?

Related issue/comment - #24948 (comment)

@superseb Maybe you know the reason?

Ive been fighting this issue on vsphere 6.7 for a while now. Apparently vmware started using cloud-init for guest customization as well as the old perl based method, and it is breaking other use-cases for cloud-init. I am trying to trace if it is related to a specific release of open-vm-tools. So far having machine version of 6.5 isnt the magic bullet. I have an ubuntu 18.04 template that rancher can deploy, and a freshly built centos and ubuntu that fail to pick up the ISO NoCloud and run with them. There is an existing KB from vmware on this. ;-( https://kb.vmware.com/s/article/54986

@AntonSmolkov The reason we aren't using guestinfo.cloud-init.config.data is it requires specific configuration of the cloud-init that is installed to be able to support it, versus using nocloud which many stock cloud-init installs come with. This means for our users and customers that are prepping their own templates/base VM's for VMware, they can simply yum/apt-get install cloud-init versus having to mess around with specific config.data support.

I can confirm this problem with:

  • rancher 2.4.5
  • ubuntu 18.04
  • vsphere 6.7u3

Thanks to @chrispage1 and @stayfrostnl for workaround:

created executable file /usr/local/bin/cleanup-cloud-init.sh on the vm template used to create nodes:

#!/bin/bash

rm -fr /var/lib/cloud/instances/
rm -fr /var/lib/cloud/instance 
rm -f /etc/machine-id /var/lib/dbus/machine-id
truncate -s 0 /etc/machine-id
ln -s /etc/machine-id /var/lib/dbus/machine-id

This script must be executed prior to powering off the vm template (every time the template is eventually powered on for updates).

I am having trouble getting an 18.04 template working. Its a brand new image with ssh installed and the above script ran. Still timesout waiting for ssh in rancher. Do I need to do anything else to get the template ready for rancher to connect to? I have not cloud init configured if that is something that is required?

I'm receiving the same error using K3S. It appears as though the NodeTemplate is passing a username of "docker" with a password of "tcuser". This error shows on the deployed nodes in /var/log/auth.log.

The username and password are displayed in the API when the ssh times out:
"sshPassword": "tcuser",
"sshPort": "22",
"sshUser": "docker",
"sshUserGroup": "staff",

You can view info this in ViewAPI. On the API page, click the NodeTemplateID link.

The only place i can find this user/password combination is in:
https://github.com/rancher/rancher/blob/master/tests/validation/tests/v3_api/test_vmwarevsphere_driver.py

I tried adding the user/password with the staff group to the VM template but it still errors out when deploying the cluster.

I can manually ssh to the target node from a K3S node using the docker/tcuser account. The deployment still fails.

Rancher: 2.4.5
Containerd Version: 1.3.3-k3s2
Kubelet Version: v1.18.8+k3s1
OS: Ubuntu 18.04.5 LTS

@carterminearIMT I did nothing else to make it work

did you followed all the steps in https://rancher.com/docs/rancher/v2.x/en/cluster-provisioning/rke-clusters/node-pools/vsphere/provisioning-vsphere-clusters/ ?
in particular enabling disks uuid
and I also set hw level of the vm to 15, but this was a requirement for CSI storage provider

our cloud-init is

#cloud-config
timezone: "Europe/Rome"
packages:
 - ntp
ssh_authorized_keys:
  - ssh-rsa AAA... our-key

but I cannot see anything relevant, the ssh keys should be merged with the one injected from rancher
maybe you can try using a cloud-init and see if it makes the difference

@VegasJK the docker/tcuser thing is an old boot2docker username/password https://github.com/docker/machine/blob/master/drivers/vmwarevsphere/vsphere.go#L42-L44. those fields in the API are from us leaving those machine cli args. The clone process will use the username and group but then we generate a key and inject that into the cloud-init here: https://github.com/rancher/machine/blob/master/drivers/vmwarevsphere/cloudinit.go#L250

Anyone having trouble with SSH connectivity and you have access to the machine you need to debug cloud-init to see why the key wasn't injected. To do that I usually check cloud-init logs and add additional logging with the logcfg. I've used this to go deep into python code to fix the driver while it was in development so you should be able to get some good information there to find out why the key wasn't added properly.

Relevant cloud-init Links:

If youre using ubuntu, the easiest thing is to use the clone features is to clone from a template made from the ova from here: http://cloud-images.ubuntu.com/bionic/current/

If anyone on this thread is using another OS, @David-VTUK has put together this repo (https://github.com/David-VTUK/Rancher-Packer) to build Rancher clone ready images and I can't recommend them more highly.

We do walk people through ssh issues a lot in the #vsphere channel in Rancher Users Slack (https://slack.rancher.io/), come join us there and post what you're seeing and we could debug in real time as some of what I'm reading here is situational and we might need a lot more information to fully get to the bottom of it.

Just logged this issue to support Fedora CoreOS. #28846

any workaround to make it work with fedora coreos ?

@Amos-85

it's going to be similar I think for fedora, but on ubuntu, you need to fix your data source list:

/etc/cloud/cloud.cfg.d/99-installer.cfg

needs the nocloud datasource

datasource_list: [ "NoCloud", "VMwareGuestInfo" ]

For fedora, i'd guess there's a similar 99-thing.cfg or you might need to make one

Btw for anyone out there banging your head against the wall, this is for Ubuntu 20 or 18

stale commented

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@m-ferrero and @stayfrostnl solution has helped me fixed the duplicate IP for the nodes after rancher created the cluster. Thank you both.
but I still see "Waiting for SSH to be available..." then see the VMs have been killed and new nodes been sinned up.

I have used ubuntu-22.10-live-server-amd64 image. Rancher version is 2.6. Kubernetes version is V1.24.4-rancher1-1.

i have done the following on the vm, and I can SSH into the VM before convert it into a template in vcenter.
sudo apt-get update
sudo apt-get install net-tools openssh-server
sudo systemctl enable ssh
sudo ufw allow ssh
systemctl start ssh