aws-samples/amazon-ecs-firelens-examples

FireLens health check recommendation

bowliang opened this issue · 0 comments

As mentioned in https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/health-check, TCP Input Health Check is not recommended anymore. And we don't want our ECS tasks to tightly depend on CloudWatch. And we want to minimize the chance to lose logs. Therefore, the only option left is Simple Uptime Health Check.

However, as you suggested, "It is a very shallow health check, Fluent Bit could be completely failing to send logs but as long as its still responsive on the monitoring interface, it will be marked as healthy."

Do you have a recommended way to monitor if the Fluent Bit is actually failing? We're ok to miss logs sporadically, but we want to monitor that and do not want to miss logs for a long time.

Several of my thoughts are:

  1. Can we use the Simple Uptime Health Check for container health check command in ECS, but in the same time, can we set up another container to call the deep health check command in Fluent Bit? Then if it cannot respond, the additional container emits a metric?
  2. Can we monitor the CloudWatch logs size and use that to decide if we're losing logs?
  3. Or do you have any recommendation to monitor in case the Simple Uptime Health Check isn't able to tell that the Fluent Bit cannot send logs?