awslabs/amazon-eks-ami

Unclean exit of bootstrap.sh on Neuron instances

GyandeepKalita opened this issue ยท 6 comments

Hi,
So I was trying to create an instance group with inf2.xlarge instance type in an eks cluster. According to the AWS docs: here & AWS Neuron Docs: here, the EKS optimized accelarated AMIs should support it. I tried creating this using /aws/service/eks/optimized-ami/1.28/amazon-linux-2-gpu/recommended/image_id as the ssm parameter for the AMI.

But the creation of the instance group failed with the following error message in cloudformation stacks:
Received 1 FAILURE signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement.

To troubleshoot it a bit further, I SSHed into the ec2 instance and found the following errors in the cloud-init.log and the cloud-init-output.log:

  • cloud-init.log:
May 30 07:02:00 cloud-init[3101]: util.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/part-001'] with allowed return codes [0] (shell=True, capture=False)
May 30 07:02:13 cloud-init[3101]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
May 30 07:02:13 cloud-init[3101]: util.py[DEBUG]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 913, in runparts
    subp(prefix + [exe_path], capture=False, shell=True)
  File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 2108, in subp
    cmd=args)
ProcessExecutionError: Unexpected error while running command.
Command: ['/var/lib/cloud/instance/scripts/part-001']
Exit code: 1
Reason: -
Stdout: -
Stderr: -
May 30 07:02:13 cloud-init[3101]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
May 30 07:02:13 cloud-init[3101]: handlers.py[DEBUG]: finish: modules-final/config-scripts-user: FAIL: running config-scripts-user with frequency once-per-instance
May 30 07:02:13 cloud-init[3101]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
May 30 07:02:13 cloud-init[3101]: util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cloudinit/stages.py", line 851, in _run_modules
    freq=freq)
  File "/usr/lib/python2.7/site-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python2.7/site-packages/cloudinit/helpers.py", line 187, in run
    results = functor(*args)
  File "/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.py", line 45, in handle
    util.runparts(runparts_path)
  File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 920, in runparts
    % (len(failed), len(attempted)))
RuntimeError: Runparts: 1 failures in 1 attempted commands
May 30 07:02:13 cloud-init[3101]: stages.py[DEBUG]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_ssh_authkey_fingerprints.pyc'>) with frequency once-per-instance
  • cloud-init-output.log:
+ gpu-ami-util has-nvidia-devices
false
+ echo 'no NVIDIA devices are present, nothing to do!'
no NVIDIA devices are present, nothing to do!
+ exit 0
2024-05-30T07:02:10+0000 [eks-bootstrap] INFO: completed GPU bootstrap helper!
Created symlink from /etc/systemd/system/multi-user.target.wants/kubelet.service to /etc/systemd/system/kubelet.service.
2024-05-30T07:02:10+0000 [eks-bootstrap] INFO: nvidia-smi found
Exited with error on line 649
++ /opt/aws/bin/cfn-signal --exit-code 1 --stack inf2-test --resource NodeGroup --region us-west-2
++ ec2-metadata -t
++ awk -F . '{print $2}'

And the line 649 where it fails in the /etc/eks/bootstrap.sh is:

nvidia-smi -q > /tmp/nvidia-smi-check

Please let me know how I can resolve this.

The bootstrap script shouldn't exit with a non-zero code in this way -- we need to move this nvidia-smi bit into the GPU helper script to resolve that -- but the use of cfn-signal is what's ultimately causing your CFN stack to fail. I'll get a PR out to fix the unclean termination of the bootstrap script; in the meantime you can change/disable the cfn-signal bit as a workaround ๐Ÿ‘

Hey, thanks for the prompt response!
I am hoping to see the issue being fixed soon.

Hi, when can I expect this fix to be reflected in the actual systems? I was actually kinda blocked on it as the workaround is not applicable for my project.

This will land in an AMI build next week. ๐Ÿ‘

Thanks!

This has been resolved ๐Ÿ‘