flatcar builds failing with "python: no such file or directory"
mboersma opened this issue · 8 comments
What steps did you take and what happened:
In the build-azure-sigs prow job, the two flatcar builds fail with this error:
==> azure-arm.sig-flatcar-gen2: Downloading spec file and debug info�[0m
�[0;32m azure-arm.sig-flatcar-gen2: Downloading Goss specs from, /tmp/goss-spec.yaml and /tmp/debug-goss-spec.yaml to current dir�[0m
�[1;32m==> azure-arm.sig-flatcar-gen2: Provisioning with shell script: /tmp/packer-shell2564298006�[0m
�[1;31m==> azure-arm.sig-flatcar-gen2: + [[ flatcar-gen2 != \f\l\a\t\c\a\r* ]]�[0m
�[1;31m==> azure-arm.sig-flatcar-gen2: + sudo bash -c '/usr/share/oem/python/bin/python /usr/share/oem/bin/waagent -force -deprovision+user && ln -sf ../run/systemd/resolve/resolv.conf /etc/resolv.conf && sync'�[0m
�[1;31m==> azure-arm.sig-flatcar-gen2: bash: line 1: /usr/share/oem/python/bin/python: No such file or directory�[0m
�[1;32m==> azure-arm.sig-flatcar-gen2: Provisioning step had errors: Running the cleanup provisioner, if present...�[0m
�[1;32m==> azure-arm.sig-flatcar-gen2:
What did you expect to happen:
Anything else you would like to add:
Environment:
Project (Image Builder for Cluster API:
Additional info for Image Builder for Cluster API related issues:
- OS (e.g. from
/etc/os-release
, orcmd /c ver
): - Packer Version:
- Packer Provider:
- Ansible Version:
- Cluster-api version (if using):
- Kubernetes version: (use
kubectl version
):
/kind bug
/assign
See https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_image-builder/1391/pull-azure-sigs/1755096352955568128 as an example of the failure.
What I can tell so far is this:
- The paths to both
python
andwaagent
have changed in current Flatcar stable builds - Even after updating the paths, the
+user
part of thewaagent -force -deprovision+user
command seems broken, exiting with a python stack trace
@invidian does this ring a bell? The command runs as the temporary packer
user:
sudo bash -c '/usr/sbin/waagent -force -deprovision+user && ln -sf ../run/systemd/resolve/resolv.conf /etc/resolv.conf && sync'
with this error:
+ sudo bash -c '/usr/sbin/waagent -force -deprovision+user && ln -sf ../run/systemd/resolve/resolv.conf /etc/resolv.conf && sync'
WARNING! The waagent service will be stopped.
WARNING! Cached DHCP leases will be deleted.
WARNING! /etc/resolv.conf will be deleted.
WARNING! packer account and entire home directory will be deleted.
WARNING! /etc/machine-id will be removed.
Traceback (most recent call last):
File "/usr/lib/python3.11/site-packages/azurelinuxagent/agent.py", line 263, in main
agent.deprovision(force, deluser=True)
File "/usr/lib/python3.11/site-packages/azurelinuxagent/agent.py", line 155, in deprovision
deprovision_handler.run(force=force, deluser=deluser)
File "/usr/lib/python3.11/site-packages/azurelinuxagent/pa/deprovision/default.py", line 221, in run
self.do_actions(actions)
File "/usr/lib/python3.11/site-packages/azurelinuxagent/pa/deprovision/default.py", line 241, in do_actions
action.invoke()
File "/usr/lib/python3.11/site-packages/azurelinuxagent/pa/deprovision/default.py", line 57, in invoke
self.func(*self.args, **self.kwargs)
File "/usr/lib/python3.11/site-packages/azurelinuxagent/common/osutil/default.py", line 1342, in del_account
if self.is_sys_user(username):
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/azurelinuxagent/common/osutil/coreoscommon.py", line 29, in is_sys_user
return super(CoreOSUtil, self).is_sys_user(username)
^^^^^^^^^^
NameError: name 'CoreOSUtil' is not defined
==>
During handling of the above exception, another exception occurred:
==> azure-arm.sig-flatcar:
Traceback (most recent call last):
File "/usr/sbin/waagent", line 39, in <module>
agent.main()
File "/usr/lib/python3.11/site-packages/azurelinuxagent/agent.py", line 283, in main
textutil.format_exception(e))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/azurelinuxagent/common/utils/textutil.py", line 448, in format_exception
msg += ''.join(traceback.format_exception(etype=type(exception), value=exception, tb=tb))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: format_exception() got an unexpected keyword argument 'etype'
If I run the same command with just -deprovision
(not -deprovision+user
), it succeeds. Or if I run it interactively on a Flatcar VM (as the core
user) it succeeds. Looking at the python code, it returns early if user=="core"
.
I wonder if this is the problem with this condition not matching:
�[1;31m==> azure-arm.sig-flatcar-gen2: + [[ flatcar-gen2 != \f\l\a\t\c\a\r* ]]�[0m
EDIT: ah wait, it should match, I missed the asterisk...
@jepio @dongsupark since you work closer with Flatcar, any ideas what could be happening or what could have recently changed?
Yeah several things going on here:
- python and waagent moved to system locations /usr/bin and /usr/sbin
- Flatcar has a downstream patch that was reworked and has a bug in the deprovision case
- waagent hits an additional exception when handling the exception because it wasnt tested with >python3.10 and flatcar moved to python3.11
I've posted a PR, lets test it to see if it gets this unblocked.