kubernetes-sigs/image-builder

Cloud-init fails for ubuntu 20.04 base AMI and Cloud-init version '23.3.1-0ubuntu1~20.04.1'

supershal opened this issue · 9 comments

What steps did you take and what happened:

The latest cloud-init version 23.3.1-0ubuntu1~20.04.1 that is shipped with base AMI for Ubuntu 20.04 is unable to run boothook https://cloudinit.readthedocs.io/en/latest/explanation/format.html#cloud-boothook provided by CAPA, https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/0bf78b04b305a77aec37a68c107102231faa7a16/pkg/cloud/services/secretsmanager/secret_fetch_script.go#L20
As a result the CAPA VMs are not initializing as expected.

Steps to reproduce:

  1. create an AMI using image-builder
make build-ami-ubuntu-2004
  1. Create CAPA cluster using the AMI created in step 1 using instructions at: https://cluster-api-aws.sigs.k8s.io/getting-started.html

  2. Check logs at /var/log/cloud-init-output.log

What did you expect to happen:
Cloud-init run successfully on the VM

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Log from cloud-init.

2023-10-24 18:53:21] 2023-10-24 18:53:21,892 - util.py[WARNING]: failed stage init
[2023-10-24 18:53:21] failed run of stage init
[2023-10-24 18:53:21] ------------------------------------------------------------
[2023-10-24 18:53:21] Traceback (most recent call last):
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 78, in read_file_or_url
[2023-10-24 18:53:21]     with open(file_path, "rb") as fp:
[2023-10-24 18:53:21] FileNotFoundError: [Errno 2] No such file or directory: '/etc/secret-userdata.txt'
[2023-10-24 18:53:21]
[2023-10-24 18:53:21] The above exception was the direct cause of the following exception:
[2023-10-24 18:53:21]
[2023-10-24 18:53:21] Traceback (most recent call last):
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 238, in _do_include
[2023-10-24 18:53:21]     resp = read_file_or_url(
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 84, in read_file_or_url
[2023-10-24 18:53:21]     raise UrlError(cause=e, code=code, headers=None, url=url) from e
[2023-10-24 18:53:21] cloudinit.url_helper.UrlError: [Errno 2] No such file or directory: '/etc/secret-userdata.txt'
[2023-10-24 18:53:21]
[2023-10-24 18:53:21] The above exception was the direct cause of the following exception:
[2023-10-24 18:53:21]
[2023-10-24 18:53:21] Traceback (most recent call last):
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 766, in status_wrapper
[2023-10-24 18:53:21]     ret = functor(name, args)
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 453, in main_init
[2023-10-24 18:53:21]     init.update()
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 484, in update
[2023-10-24 18:53:21]     self._store_processeddata(self.datasource.get_userdata(), "userdata")
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 599, in get_userdata
[2023-10-24 18:53:21]     self.userdata = self.ud_proc.process(self.get_userdata_raw())
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 88, in process
[2023-10-24 18:53:21]     self._process_msg(convert_string(blob), accumulating_msg)
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 159, in _process_msg
[2023-10-24 18:53:21]     self._do_include(payload, append_msg)
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 264, in _do_include
[2023-10-24 18:53:21]     _handle_error(message, urle)
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 72, in _handle_error
[2023-10-24 18:53:21]     raise RuntimeError(error_message) from source_exception
[2023-10-24 18:53:21] RuntimeError: [Errno 2] No such file or directory: '/etc/secret-userdata.txt' for url: file:///etc/secret-userdata.txt
[2023-10-24 18:53:21] ------------------------------------------------------------
[2023-10-24 18:53:40] Cloud-init v. 23.3.1-0ubuntu1~20.04.1 running 'modules:config' at Tue, 24 Oct 2023 18:53:37 +0000. Up 42.69 seconds.
[2023-10-24 18:53:40] Cloud-init v. 23.3.1-0ubuntu1~20.04.1 running 'modules:final' at Tue, 24 Oct 2023 18:53:40 +0000. Up 46.25 seconds.
[2023-10-24 18:53:40] Cloud-init v. 23.3.1-0ubuntu1~20.04.1 finished at Tue, 24 Oct 2023 18:53:40 +0000. Datasource DataSourceEc2Local.  Up 46.42 second

Environment:

Project (Image Builder for Cluster API:

Additional info for Image Builder for Cluster API related issues:

  • OS (e.g. from /etc/os-release, or cmd /c ver): ubuntu-20.04
  • Packer Version:
  • Packer Provider:
  • Ansible Version:
  • Cluster-api version (if using):
  • Kubernetes version: (use kubectl version):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

we were able to downgrade the cloud-init to 23.2.1-0ubuntu0~20.04.2 and create cluster successfully. mesosphere/konvoy-image-builder#938
cc: @voor @cnmcavoy

We are still not sure of the root cause and change in cloud-init that resulted in this issue.

I was able to provide following override file to the image-builder and build AMI that can run CAPA cloud-init script successfully.
pin-cloud-init-override.json :

{
    "ansible_extra_vars": "pinned_debs=\"cloud-init=23.1.2-0ubuntu0~20.04.2\""
}

I built the image using following makefile target of image-builder
make build-ami-ubuntu-2004 PACKER_VAR_FILES=pin-cloud-init-override.json

We will have to now investigate what changes in 23.3.1-0ubuntu1~20.04.1 broke the CAPA cloud-init script.

voor commented

Moving over some comments from slack so they're not lost in the sands of time:

  • AWS mirrors do not seem to be keeping all versions of cloud-init consistently, so needed to download the debian package from elsewhere and host it.
  • Pinning the version seems to resolve the issue
  • This might be related to #406 which historically caused issues with CAPA.
- name: Downgrade cloud init.
  apt:
    deb: http://launchpadlibrarian.net/679992659/cloud-init_23.2.2-0ubuntu0~20.04.1_all.deb
    state: present
    force: true

- name: Pin cloud init to prevent version issues.
  dpkg_selections:
    name: "{{ item }}"
    selection: hold
  loop:
    - cloud-init

For image-builder users who have hit this bug and are reading this issue:

We believe the root cause to be in cloud-init, and would like to fix it there (see canonical/cloud-init#4572). We prefer to do this to the alternative, which is to "pin" an older, known-good cloud-init version in image-builder itself.

For now, if you use image-builder to create an Ubuntu 20.04 AMI, please use the workaround described in #1333 (comment).

This might be related to #406 which historically caused issues with CAPA.

@supershal and I found that the feature override mechanism used in #406 does not work in the recent versions of cloud-init in Ubuntu 20.04. This mechanism was removed from cloud-init in canonical/cloud-init#4228.

Patching cloud-init is the officially documented mechanism now:

Currently used upstream values for feature flags are set in cloudinit/features.py. Overrides to these values should be patched directly (e.g., via quilt patch) by downstreams.

I guess modifying the cloud-init python module to set ERROR_ON_USER_DATA_FAILURE = False is something image-builder can do for now. But once Ubuntu 20.04 is EOL, the feature flag itself will be removed.

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.