SSH-dependent Packer provisioners not working with `schedmd-v5-slurm` images.
stoksc opened this issue · 5 comments
Describe the bug
When using the image-builder.yaml
, if I supply custom shell_scripts
arguments to packer build
(or an Ansible conf, or anything that needs SSH) I get SSH connection failures. If I replace the schedmd-v5-slurm-22-05-8-hpc-centos-7
image family, with something like ubuntu-2210-amd64
SSH works fine. If I don't use SSH-dependent commands, the image build also works fine.
Steps to reproduce
Steps to reproduce the behavior:
- Generate the
image-builder.yaml
build files withghcp
. touch echo.sh
and modify thepacker/custom-image/defaults.auto.pkrvars.hcl
to contain
shell_scripts = ["echo.sh"]
- Follow the docs to build the image.
Expected behavior
I expect SSH to work and run the shell
Packer provisioners.
Actual behavior
SSH does not work.
Version (ghpc --version
)
Blueprint
Output and logs
Build 'image-builder-001.googlecompute.toolkit_image' errored after 2 minutes 40 seconds: Packer experienced an authentication error when trying to connect via SSH. This can happen if your username/password are wrong. You may want to double-check your credentials as part of your debugging process. original error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
Screenshots
If applicable, add screenshots to help explain your problem.
Execution environment
- OS: [macOS]
- Shell: [zsh]
- go version:
go version go1.20.2 darwin/arm64
Additional context
Before using hpc-toolkit
, I was just using Packer independently but could not get Ansible provisioning + these slurm images to work. I get the same on a very minimal Packer+shell_scripts+schedmd-v5-slurm repo, too (where it also just goes away with plain ubuntu). Any pointers appreciated. Thanks.
Digging into this, I noticed when specifying ssh_username=packer
in my config that I see GCP instance metadata ssh-keys
for the packer
user in the GCP UI. When I ssh to the Packer builder, the packer
user did exist but sshd couldn't find the authorized_keys file for this user (and the user had a home dir listed in /etc/passwd
, /home/packer
, but it didn't exist). I gave wait_to_add_ssh_keys
incase there was some race, but had no luck. So, I decided to not figure this packer-ssh stuff out and just use use_os_login = true
instead and it worked.
But it bothered me, so I dug through the slurm-gcp
repo a bit and saw it is using ssh_clear_authorized_keys = true
, and I remembered the weirdness about having no home dir, so I tried ssh_username=packer2
(different from the slurm-gcp
default) and it worked.
I'm pretty deep into 'just try things and see what happens' and pretty far from 'I actually understand Packer', but it feels like there is some weirdness with chained Packer builds and the way ssh_clear_authorized_keys
cleans up the user and how GCP ssh_keys
metadata gets provisioned.
Anyway, maybe it makes sense to change the default user here and add a comment in the description explaining why choosing 'packer' is probably a bad idea seems like an improvement. Also, maybe opening an issue in Packer with an even more minimal reproduction is nice - even if it just becomes a docs ticket. I may do that later.
The SSH issues are interesting and require some internal discussion and investigation.
I am wondering if your use case is blocked on having SSH access. Generally SSH access can be problematic for several reasons and so the HPC Toolkit is designed to support most use cases without needing SSH by using the startup_script
input variable.
For more complex scenarios (staging large amounts of data, running multiple scripts, ansible playbooks) you can use the custom-image
packer module in concert with the startup-script
terraform module. This will stage all data and scripts in a GCS bucket and then provide vm metadata with instructions to download and execute each script. This will all happen without needing SSH access. This would entail defining additional runners on L45 for your other scripts and ansible and then following the command line instructions for that example to pass the script from the terraform deployment to the packer module.
The startup-script
module probably solves my issues too, but I have other Packer builds using Ansible over SSH and I didn't want this one to be so special.
As I mentioned in my comment, I've changed the ssh_username
to packer2
and unblocked myself. use_os_login
also worked. The only thing that doesn't is using the user packer
.
@stoksc, Thank you for this report. We have updated the default username in the HPC Toolkit image-builder module to be hpc-toolkit-packer
which side steps this issue. I have also alerted the Slurm team to the behavior witnessed in this issue.
I am going to mark this bug as fixed upon our next release, which will have the updated username. Please follow up if you feel that it has not been addressed.
This fix is now on our develop branch which is periodically merged into our main branch and tagged with an official release. As each step in the process is taken, the information immediately below the commit description will automatically update with each branch and tag that contains the fix.