PrefectHQ/prefect

ECS Worker does not scale - Prefect re-registers the same task definition over and over, hitting rate limits

Closed this issue · 0 comments

Bug summary

Hi all,

We've been moving to the serverless work pools so each flow can run on an independent ECS fargate task. This worked well in practise when we showed it working with a single flow, so we've ported over multiple flows, only for them all to start failing because AWS rate limits us. What seems to be happening is that each flow run is registering a brand new task definition, even if it doesn't need to.

This is the error we get:

botocore.errorfactory.ClientException: An error occurred (ClientException) when calling the RegisterTaskDefinition operation: Too many concurrent attempts to create a new revision of the specified family.

And if I run flows, I see things like this:

Retrieving ECS task definition 'arn:aws:ecs:eu-west-2:663985622336:task-definition/telemetry-cloud-raw-to-source:1'...

Registering ECS task definition...

Task definition request{
  "cpu": "16384",
  "family": "telemetry-cloud-raw-to-source",
  "memory": "65536",
  "executionRoleArn": "arn:aws:iam::663985622336:role/prod-ecs-task-execution-role",
  "containerDefinitions": [
    {
      "image": "596302374988.dkr.ecr.eu-west-2.amazonaws.com/nimbus/prefect-flows-datalake:prod",
      "name": "prefect",
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-create-group": "true",
          "awslogs-group": "/prefect-prod",
          "awslogs-region": "eu-west-2",
          "awslogs-stream-prefix": "prefect-prod"
        }
      }
    }
  ],
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc"
}

Using ECS task definition 'arn:aws:ecs:eu-west-2:663985622336:task-definition/telemetry-cloud-raw-to-source:2'...

The next time the flow runs, even with identical task definition, it will read revision two and make revision three.

Googling the issue seems to indicate that this has been a problem for a while with no workarounds posted yet. Prior lit:

Version info

Version:             2.20.10
API version:         0.8.4
Python version:      3.11.8
Git commit:          4fb64ec3
Built:               Wed, Oct 16, 2024 1:24 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.37.2

Additional context

EDIT: I've googled more and I believe this issue is a duplicate of #15865, I'll close this one out.