bottlerocket-os/bottlerocket

`host-ctr` cli crashes when pulling public ECR image

taraspos opened this issue · 11 comments

host-ctr CLI crashes with panic when trying to pull any public ECR image, while private ones work fine.

Image Can pull?
328549459982.dkr.ecr.us-east-1.amazonaws.com/bottlerocket-control:v0.7.12
public.ecr.aws/bottlerocket/bottlerocket-control:v0.7.12

Image I'm using:

bash-5.1# cat /etc/os-release
NAME=Bottlerocket
ID=bottlerocket
VERSION="1.19.4 (aws-k8s-1.28)"
PRETTY_NAME="Bottlerocket OS 1.19.4 (aws-k8s-1.28)"
VARIANT_ID=aws-k8s-1.28
VERSION_ID=1.19.4
BUILD_ID=4f0a078e
HOME_URL="https://github.com/bottlerocket-os/bottlerocket"
SUPPORT_URL="https://github.com/bottlerocket-os/bottlerocket/discussions"
BUG_REPORT_URL="https://github.com/bottlerocket-os/bottlerocket/issues"
DOCUMENTATION_URL="https://bottlerocket.dev"

What I expected to happen:
ECR image is successfully pulled

What actually happened:

Running host-ctr run --source public.ecr.aws/bottlerocket/bottlerocket-control:v0.7.12 --container-id test results in:

time="2024-06-17T12:25:22Z" level=info msg="Image does not exist, proceeding to pull image from source." ref="public.ecr.aws/bottlerocket/bottlerocket-control:v0.7.12"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x557adfa3a83d]

goroutine 1 [running]:
main.withDynamicResolver({0x557ae03822d8?, 0xc0006b1200}, {0x7ffce586eebc, 0x38}, 0x0)
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/cmd/host-ctr/main.go:1150 +0x19d
main.pullImage({0x557ae03822d8, 0xc0006b1200}, {0x7ffce586eebc, 0x38}, 0x38?, {0x0?, 0xc00071d308?}, 0xc00064b1d0)
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/cmd/host-ctr/main.go:1046 +0x39e
main.fetchImage({0x557ae03822d8, 0xc0006b1200}, {0x7ffce586eebc, 0x38}, 0x557adfa46468?, {0x0, 0x0}, 0x0, 0x0?)
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/cmd/host-ctr/main.go:1013 +0x3e7
main.runCtr({0x557adfa8c224, 0x24}, {0x557adfa46468, 0x7}, {0x7ffce586ef04, 0x4}, {0x7ffce586eebc, 0x38}, 0x0, {0x0, ...}, ...)
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/cmd/host-ctr/main.go:299 +0x467
main.App.func1(0xc0004c4000?)
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/cmd/host-ctr/main.go:144 +0x93
github.com/urfave/cli/v2.(*Command).Run(0xc0004c4000, 0xc0004b8d40, {0xc0004bc320, 0x5, 0x5})
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/vendor/github.com/urfave/cli/v2/command.go:279 +0x9dd
github.com/urfave/cli/v2.(*Command).Run(0xc0004c51e0, 0xc0004b8500, {0xc0000401e0, 0x6, 0x6})
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/vendor/github.com/urfave/cli/v2/command.go:272 +0xc2e
github.com/urfave/cli/v2.(*App).RunContext(0xc000156e00, {0x557ae0382268?, 0x557ae10db440}, {0xc0000401e0, 0x6, 0x6})
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/vendor/github.com/urfave/cli/v2/app.go:337 +0x5db
github.com/urfave/cli/v2.(*App).Run(...)
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/vendor/github.com/urfave/cli/v2/app.go:311
main.main()
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/cmd/host-ctr/main.go:60 +0x3f

// For Amazon ECR Public registries, we should try and fetch credentials before resolving the image reference
case strings.HasPrefix(ref, "public.ecr.aws/"):
// ... not if the user has specified their own registry credentials for 'public.ecr.aws'; In that case we use the default resolver.
if _, found := registryConfig.Credentials["public.ecr.aws"]; found {
return defaultResolver
}

How to reproduce the problem:

  1. Connect to Bottlerocket node
  2. enter-admin-container
  3. sudo sheltie
  4. host-ctr run --source public.ecr.aws/bottlerocket/bottlerocket-control:v0.7.12 --container-id test

Thanks for the report (and thanks for the very clear reproduction instructions, in particular).

Initial triage says:

  • Yes, this reproduces as advertised on our latest release. Not a big surprise, since this code hasn't changed recently, but worth noting.
  • We may not have encountered this earlier because the default URL for this container (at least on my aws-eks variant node) points to a private repository rather than public.ecr.aws.
  • Given the code that is failing here, there's a clear expectation that this should work, and at the very least, not segfault.

The segfault occurs because the caller has passed a null registryConfig pointer to the victim withDynamicResolver function. The solution seems simple enough (i.e., don't dereference the null pointer). Thanks again for the report.

A little more context: the host-ctr executable is invoked by systemd services (see the boot-containers@ and host-containers@ services in package/os). In those service files the service supplies the registry-config option, so host-ctr does not segfault there. If you wish to use host-ctr outside of those services, you can work around this problem by adding --registry-config /dev/null to your own invocation of host-ctr.

I have verified that settings.host-containers.control.source can be a public ECR URI. For production, you can set this via user data on your worker instances.

@taraspos did @larvacea's comment resolve your issue?

If you wish to use host-ctr outside of those services, you can work around this problem by adding --registry-config /dev/null to your own invocation of host-ctr.

@taraspos did @larvacea's comment resolve your issue?

If you wish to use host-ctr outside of those services, you can work around this problem by adding --registry-config /dev/null to your own invocation of host-ctr.

Hey @yeazelm,
yes, using this workaround prevents host-ctr from crashing

Awesome! Glad to hear this got you unblocked. I'll resolve this issue then.

Awesome! Glad to hear this got you unblocked. I'll resolve this issue then.

I'm not sure if resolving the issue would be the right approach, even though panic in the CLI can be worked around it has to be fixed in the long term.

I'll reopen this then to track fixing the original issue on the panic.

I have a fix progressing through the pipeline. I'll keep this issue updated.