GoogleCloudPlatform/cluster-toolkit

SSH-dependent Packer provisioners not working with `schedmd-v5-slurm` images.

stoksc opened this issue · 5 comments

Describe the bug

When using the image-builder.yaml, if I supply custom shell_scripts arguments to packer build (or an Ansible conf, or anything that needs SSH) I get SSH connection failures. If I replace the schedmd-v5-slurm-22-05-8-hpc-centos-7 image family, with something like ubuntu-2210-amd64 SSH works fine. If I don't use SSH-dependent commands, the image build also works fine.

Steps to reproduce

Steps to reproduce the behavior:

  1. Generate the image-builder.yaml build files with ghcp.
  2. touch echo.sh and modify the packer/custom-image/defaults.auto.pkrvars.hcl to contain
shell_scripts = ["echo.sh"]
  1. Follow the docs to build the image.

Expected behavior

I expect SSH to work and run the shell Packer provisioners.

Actual behavior

SSH does not work.

Version (ghpc --version)

Blueprint

image-builder.yaml

Output and logs

Build 'image-builder-001.googlecompute.toolkit_image' errored after 2 minutes 40 seconds: Packer experienced an authentication error when trying to connect via SSH. This can happen if your username/password are wrong. You may want to double-check your credentials as part of your debugging process. original error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

Screenshots

If applicable, add screenshots to help explain your problem.

Execution environment

  • OS: [macOS]
  • Shell: [zsh]
  • go version: go version go1.20.2 darwin/arm64

Additional context

Before using hpc-toolkit, I was just using Packer independently but could not get Ansible provisioning + these slurm images to work. I get the same on a very minimal Packer+shell_scripts+schedmd-v5-slurm repo, too (where it also just goes away with plain ubuntu). Any pointers appreciated. Thanks.

Digging into this, I noticed when specifying ssh_username=packer in my config that I see GCP instance metadata ssh-keys for the packer user in the GCP UI. When I ssh to the Packer builder, the packer user did exist but sshd couldn't find the authorized_keys file for this user (and the user had a home dir listed in /etc/passwd, /home/packer, but it didn't exist). I gave wait_to_add_ssh_keys incase there was some race, but had no luck. So, I decided to not figure this packer-ssh stuff out and just use use_os_login = true instead and it worked.

But it bothered me, so I dug through the slurm-gcp repo a bit and saw it is using ssh_clear_authorized_keys = true, and I remembered the weirdness about having no home dir, so I tried ssh_username=packer2 (different from the slurm-gcp default) and it worked.

I'm pretty deep into 'just try things and see what happens' and pretty far from 'I actually understand Packer', but it feels like there is some weirdness with chained Packer builds and the way ssh_clear_authorized_keys cleans up the user and how GCP ssh_keys metadata gets provisioned.

Anyway, maybe it makes sense to change the default user here and add a comment in the description explaining why choosing 'packer' is probably a bad idea seems like an improvement. Also, maybe opening an issue in Packer with an even more minimal reproduction is nice - even if it just becomes a docs ticket. I may do that later.

The SSH issues are interesting and require some internal discussion and investigation.

I am wondering if your use case is blocked on having SSH access. Generally SSH access can be problematic for several reasons and so the HPC Toolkit is designed to support most use cases without needing SSH by using the startup_script input variable.

For more complex scenarios (staging large amounts of data, running multiple scripts, ansible playbooks) you can use the custom-image packer module in concert with the startup-script terraform module. This will stage all data and scripts in a GCS bucket and then provide vm metadata with instructions to download and execute each script. This will all happen without needing SSH access. This would entail defining additional runners on L45 for your other scripts and ansible and then following the command line instructions for that example to pass the script from the terraform deployment to the packer module.

The startup-script module probably solves my issues too, but I have other Packer builds using Ansible over SSH and I didn't want this one to be so special.

As I mentioned in my comment, I've changed the ssh_username to packer2 and unblocked myself. use_os_login also worked. The only thing that doesn't is using the user packer.

@stoksc, Thank you for this report. We have updated the default username in the HPC Toolkit image-builder module to be hpc-toolkit-packer which side steps this issue. I have also alerted the Slurm team to the behavior witnessed in this issue.

I am going to mark this bug as fixed upon our next release, which will have the updated username. Please follow up if you feel that it has not been addressed.

This fix is now on our develop branch which is periodically merged into our main branch and tagged with an official release. As each step in the process is taken, the information immediately below the commit description will automatically update with each branch and tag that contains the fix.