awslabs/amazon-s3-find-and-forget

New deployments fail at container build step

ctd opened this issue · 5 comments

ctd commented

Several people have reported now that the CloudFormation stack deployment fails when deploying the main S3 Find and Forget solution stack.

The error appears as failure to create the WaitForContainerBuild resource in the DeployStack. On closer inspection of the associated Code Build job, the failure is being caused by rate limiting when pulling the base image from Dockerhub:

Step 1/23 : ARG src_path=backend/ecs_tasks/delete_files
Step 2/23 : ARG layers_path=backend/lambda_layers
Step 3/23 : FROM python:3.7-slim as base
toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading:
https ://www.docker.com/increase-rate-limit
[Container] 2020/11/24 18:02:26 Command did not exit successfully docker build —tag "$IMAGE_URI" -f
backend/ecs_tasks/delete_files/Dockerfile . exit status 1
[Container] 2020/11/24 18:02:26 Phase complete: BUILD State: FAILED

By nature of this error, it may not be reliably reproduced, but it does seem to happen "reliably enough" to be expected when deploying at this time.

More info about the Dockerhub rate limit: https://aws.amazon.com/blogs/containers/advice-for-customers-dealing-with-docker-hub-rate-limits-and-a-coming-soon-announcement/

All versions are affected. We are working on a resolution. There are no current workarounds, apart from locating and retrying a failed build for the Backend (CodePipeline > Pipelines > S3F2-DeployStack > Backend > Retry failed build) if the WaitForContainerBuild custom resource takes long time to stabilize during a deployment.

After the ChangeSet getting to FAILED state, I don't see anything in CodePipeline and CodeBuild dashboards, it's completely blank. How can I get an option to re-run there?

ctd commented

There's a few options to address this that come to mind.

Mirror the base image ourselves

This would be my choice at this time.

Instead of CodeBuild pulling the base image from Docker, we mirror a copy that the build pulls from. We can use ECR for this, or we could host a tarball that we retrieve and load using docker load. ECR makes more sense, and would be my preference, the only downside being that anyone retrieving the image will need to have working aws credentials with ecr:GetAuthorizationToken permissions - which isn't a problem for deployments as we can provision that for the IAM role used by CodeBuild.

Main downside to this is we will need to build a process to regularly mirror the latest image from Dockerhub.

Build and host the solution image

This is similar to the previous option, in that we'd be hosting the container image for all deployments to retrieve from -- but instead of mirroring the base, we build the final image and deployments just pull this.

Upside: we can provide better assurances of a working image and minimise deployment-time errors.

Downsides: makes it more difficult to fork/customise the solution and deploy your own version, more work required (we want to turnaround a resolution to this issue as quickly as possible).

Provide a mechanism to specify Dockerhub login credentials

Dockerhub doesn't apply limits to some authenticated accounts. We can give customers the option to use their account credentials at build-time.

Downside: Puts a lot of onus on the customer to care about Dockerhub (not core to this solution), we'd have to handle credentials (strong preference not to do this)

After the ChangeSet getting to FAILED state, I don't see anything in CodePipeline and CodeBuild dashboards, it's completely blank. How can I get an option to re-run there?

If that is a first deployment, probably the resources get deleted after the failure. I guess you can try re-running in two scenarios:

  1. you are updating an existing stack
  2. you are creating a new stack, but the WaitForContainerBuild lambda is still waiting (and the codebuild aldready failed)

This would be my choice at this time.

+1

We can use ECR for this, or we could host a tarball that we retrieve and load using docker load.

To clarify, given ECR public is not available yet, the only available option right now would be to use docker load with a hosted tarball, right?

Main downside to this is we will need to build a process to regularly mirror the latest image from Dockerhub.

True but that would be done just on our github action during release right? At least we can monitor and retry (or just authenticate) on our side rather than hoping to do that on the codepipeline/customer side.

ctd commented

I'll be testing the draft fix today and if there are no complications we may have this released by the end of the day (GMT).