kubernetes-sigs/image-builder

flatcar builds failing with "python: no such file or directory"

mboersma opened this issue · 8 comments

What steps did you take and what happened:

In the build-azure-sigs prow job, the two flatcar builds fail with this error:

==> azure-arm.sig-flatcar-gen2: Downloading spec file and debug info�[0m
�[0;32m    azure-arm.sig-flatcar-gen2: Downloading Goss specs from, /tmp/goss-spec.yaml and /tmp/debug-goss-spec.yaml to current dir�[0m
�[1;32m==> azure-arm.sig-flatcar-gen2: Provisioning with shell script: /tmp/packer-shell2564298006�[0m
�[1;31m==> azure-arm.sig-flatcar-gen2: + [[ flatcar-gen2 != \f\l\a\t\c\a\r* ]]�[0m
�[1;31m==> azure-arm.sig-flatcar-gen2: + sudo bash -c '/usr/share/oem/python/bin/python /usr/share/oem/bin/waagent -force -deprovision+user && ln -sf ../run/systemd/resolve/resolv.conf /etc/resolv.conf && sync'�[0m
�[1;31m==> azure-arm.sig-flatcar-gen2: bash: line 1: /usr/share/oem/python/bin/python: No such file or directory�[0m
�[1;32m==> azure-arm.sig-flatcar-gen2: Provisioning step had errors: Running the cleanup provisioner, if present...�[0m
�[1;32m==> azure-arm.sig-flatcar-gen2: 

What did you expect to happen:

Anything else you would like to add:

Environment:

Project (Image Builder for Cluster API:

Additional info for Image Builder for Cluster API related issues:

  • OS (e.g. from /etc/os-release, or cmd /c ver):
  • Packer Version:
  • Packer Provider:
  • Ansible Version:
  • Cluster-api version (if using):
  • Kubernetes version: (use kubectl version):

/kind bug

/assign

What I can tell so far is this:

  • The paths to both python and waagent have changed in current Flatcar stable builds
  • Even after updating the paths, the +user part of the waagent -force -deprovision+user command seems broken, exiting with a python stack trace

@invidian does this ring a bell? The command runs as the temporary packer user:

sudo bash -c '/usr/sbin/waagent -force -deprovision+user && ln -sf ../run/systemd/resolve/resolv.conf /etc/resolv.conf && sync'

with this error:

+ sudo bash -c '/usr/sbin/waagent -force -deprovision+user && ln -sf ../run/systemd/resolve/resolv.conf /etc/resolv.conf && sync'
    WARNING! The waagent service will be stopped.
    WARNING! Cached DHCP leases will be deleted.
    WARNING! /etc/resolv.conf will be deleted.
    WARNING! packer account and entire home directory will be deleted.
    WARNING! /etc/machine-id will be removed.
Traceback (most recent call last):
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/agent.py", line 263, in main
    agent.deprovision(force, deluser=True)
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/agent.py", line 155, in deprovision
    deprovision_handler.run(force=force, deluser=deluser)
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/pa/deprovision/default.py", line 221, in run
    self.do_actions(actions)
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/pa/deprovision/default.py", line 241, in do_actions
    action.invoke()
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/pa/deprovision/default.py", line 57, in invoke
    self.func(*self.args, **self.kwargs)
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/common/osutil/default.py", line 1342, in del_account
    if self.is_sys_user(username):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/common/osutil/coreoscommon.py", line 29, in is_sys_user
    return super(CoreOSUtil, self).is_sys_user(username)
                 ^^^^^^^^^^
NameError: name 'CoreOSUtil' is not defined
==>
During handling of the above exception, another exception occurred:
==> azure-arm.sig-flatcar:
Traceback (most recent call last):
  File "/usr/sbin/waagent", line 39, in <module>
    agent.main()
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/agent.py", line 283, in main
    textutil.format_exception(e))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/common/utils/textutil.py", line 448, in format_exception
    msg += ''.join(traceback.format_exception(etype=type(exception), value=exception, tb=tb))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: format_exception() got an unexpected keyword argument 'etype'

If I run the same command with just -deprovision (not -deprovision+user), it succeeds. Or if I run it interactively on a Flatcar VM (as the core user) it succeeds. Looking at the python code, it returns early if user=="core".

I wonder if this is the problem with this condition not matching:

�[1;31m==> azure-arm.sig-flatcar-gen2: + [[ flatcar-gen2 != \f\l\a\t\c\a\r* ]]�[0m

EDIT: ah wait, it should match, I missed the asterisk...

@jepio @dongsupark since you work closer with Flatcar, any ideas what could be happening or what could have recently changed?

Yeah several things going on here:

  • python and waagent moved to system locations /usr/bin and /usr/sbin
  • Flatcar has a downstream patch that was reworked and has a bug in the deprovision case
  • waagent hits an additional exception when handling the exception because it wasnt tested with >python3.10 and flatcar moved to python3.11

I've posted a PR, lets test it to see if it gets this unblocked.