/ecs-appmesh-task-helper

Primary LanguagePythonApache License 2.0Apache-2.0

ecs-appmesh-task-helper

What this is

The repo is to define a small event driven python app that orchestrates the lifecycle stages of an AWS ECS task in an AppMesh environment. The stages we are interested in are startup and shutdown. In an ECS service with a load balancer attached, during a task definition update ECS ensures that the new tasks started are registered with the LB and in service before it continues to drain and stop the old tasks. This native helper does not exist for AppMesh, so this task helper has been written in order to bridge that gap.

Operation

When a task in an AppMesh mesh has been asked by ECS to stop, it needs to notify its downstreams that it is stopping so that the downstreams stop sending traffic before the task stops. This is to ensure the downstreams do not receive any 5xx or timeout errors, which could impact service performance. This is done by signalling the Envoy in the task to go into healthcheck fail mode. The Envoy admin api has an endpoint which controls this. Secondly, this task helper is designed to delay the termination of the application and envoy containers in the task in order to ensure the downstream envoys have completed their upstream health check cycle. When ECS terminates a task, it executes "docker stop" against each container in the task in the reverse order of the defined dependencies in the task. "docker stop" sends a SIGTERM signal to the process inside the container when it runs. After the SIGTERM, this task helper calls the admin api and pauses for a definable drain timeout period. When the task helper DependsOn the envoy and application containers, their termination is delayed. There is a further problem with a race condition that often sees retiring tasks stopped before new tasks are in service during a task definition update. There is a delay period encoded before the admin api is called to ensure that new tasks have plenty of time to become in service before retiring tasks are drained and stopped. This ensures an overlap between the new and retiring tasks so that updates can be non-disruptive.

How to build this

An image containing the application can be generated by running make, or more explicitly make build, in the project folder.

The makefile will run the following tasks through batect, using Dockerfile.batect as the runner:

- lint: runs `black` to format, and `flake8` and `hadolint` to lint the source python and Dockerfile.
- test: runs `bandit` security checks, `mypy` typechecks and finally `pytest`.
- build: runs poetry to export the requirements.txt required by the root Dockerfile before building locally.

For CI Jenkins will build directly from the Dockerfile and publish automatically.

How to use this

The container image should then be included in the container definitions for your ECS task, with DependsOn dependencies defined to your application and envoy containers. The task helper container should have a StopTimeout of 120s defined to ensure optimum operation. There are 2 environment variables that can be optionally passed to the container to adjust its operation.

  • DRAIN_DELAY: The time period to wait after the SIGTERM but before calling the healthcheck/fail admin endpoint. Defaults to 40
  • DRAIN_TIMEOUT: The time period to wait after calling the healthcheck/fail admin endpoint. Defaults to 40

License

This code is open source software licensed under the Apache 2.0 License.