aws/aws-for-fluent-bit

Version 2.32.2.20241003 Cannot allocate memory and get killed with code 139

Closed this issue ยท 12 comments

Describe the question/issue

After getting the last tag of the fluent bit container, the log router sidecar started getting killed with error code 139, this is being used in ECS as an essential container, making all the tasks die.

Configuration

{
            "name": "log_router",
            "image": "amazon/aws-for-fluent-bit",
            "cpu": 0,
            "portMappings": [],
            "essential": true,
            "environment": [],
            "mountPoints": [],
            "volumesFrom": [],
            "user": "0",
            "systemControls": [],
            "firelensConfiguration": {
                "type": "fluentbit",
                "options": {
                    "enable-ecs-log-metadata": "true"
                }
            }
        }

Fluent Bit Log Output

Timestamp (UTC-03:00) | Message | Container
-- | -- | --
log_router
October 07, 2024 at 19:21 (UTC-3:00) | [2024/10/07 22:21:03] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:49] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:49] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:26] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:26] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [sp] stream processor started | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [fluent bit] version=1.9.10, commit=760956f50c, pid=1 | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128 | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [cmetrics] version=0.3.7 | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877 | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224 | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [output:null:null.0] worker #0 started | log_router
October 07, 2024 at 19:20 (UTC-3:00) | Fluent Bit v1.9.10 | log_router
October 07, 2024 at 19:20 (UTC-3:00) | * Copyright (C) 2015-2022 The Fluent Bit Authors | log_router
October 07, 2024 at 19:20 (UTC-3:00) | * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd | log_router
October 07, 2024 at 19:20 (UTC-3:00) | * https://fluentbit.io | log_router
October 07, 2024 at 19:20 (UTC-3:00) | AWS for Fluent Bit Container Image Version 2.32.2.20241003 | log_router

Fluent Bit Version Info

Which AWS for Fluent Bit Versions have you tried?*
latest - 2.32.2.20241003

Which versions have you seen the issue in? Are there any versions where you do not see the issue?
2.32.2.20241003

If you are experiencing a bug, please consider upgrading to the newest release: https://github.com/aws/aws-for-fluent-bit/releases

Or, try downgrading to the latest stable version: https://github.com/aws/aws-for-fluent-bit/blob/mainline/AWS_FOR_FLUENT_BIT_STABLE_VERSION

Cluster Details

  • what is the networking setup?
    awsvpc mode
  • do you use App Mesh or a service mesh?
    No
  • does you use VPC endpoints in a network restricted VPC?
    No
  • Is throttling from the destination part of the problem? Please note that occasional transient network connection errors are often caused by exceeding limits. For example, CW API can block/drop Fluent Bit connections when throttling is triggered.
    No
  • ECS or EKS
    ECS
  • Fargate or EC2
    Fargate
  • Daemon or Sidecar deployment for Fluent Bit
    Sidecar

Application Details

We only push ERROR logs into Datadog

Steps to reproduce issue

When using latest image tag, container dies with error code 139 after few minutes, when used stable tag, container works without problem

Related Issues

Are there any related/similar aws/aws-for-fluent-bit or fluent/fluent-bit GitHub issues?

No related issues

@juan-rmd we're experiencing the same. Thanks for raising

We are seeing something very similar on our end as well as of earlier today. We have removed the logrouter as an essential container and are currently monitoring if our other tasks remain up.

I experienced the same today and reverted to the version v2.32.2.20240820

bump. Same issue here. Caused a small prod outage

I experienced the same today and reverted to the version v2.32.2.20240820

We ended up doing the same to resolve this.

Same output from a cloudwatch log group (hence the garbled lines)


2024-10-08T09:48:06.219Z
	
Fluent Bit v1.9.10
	
2024-10-08T09:48:06.219Z
	
* Copyright (C) 2015-2022 The Fluent Bit Authors
	
2024-10-08T09:48:06.219Z
	
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
	
2024-10-08T09:48:06.219Z
	
* https://fluentbit.io
	
2024-10-08T09:48:06.243Z
	
[2024/10/08 09:48:06] [ info] [fluent bit] version=1.9.10, commit=760956f50c, pid=1
	
2024-10-08T09:48:06.243Z
	
[2024/10/08 09:48:06] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
	
2024-10-08T09:48:06.243Z
	
[2024/10/08 09:48:06] [ info] [cmetrics] version=0.3.7
	
2024-10-08T09:48:06.243Z
	
[2024/10/08 09:48:06] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877
	
2024-10-08T09:48:06.243Z
	
[2024/10/08 09:48:06] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock
	
2024-10-08T09:48:06.244Z
	
[2024/10/08 09:48:06] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224
	
2024-10-08T09:48:06.244Z
	
[2024/10/08 09:48:06] [ info] [output:syslog:syslog.0] setup done for 127.0.0.1:1514 (TLS=off)
	
2024-10-08T09:48:06.245Z
	
[2024/10/08 09:48:06] [ info] [output:null:null.1] worker #0 started
	
2024-10-08T09:48:06.264Z
	
[2024/10/08 09:48:06] [ info] [sp] stream processor started
	
2024-10-08T09:48:07.351Z
	
[2024/10/08 09:48:07] [error] [output:syslog:syslog.0] no upstream connections available
	
2024-10-08T09:48:07.360Z
	
[2024/10/08 09:48:07] [ warn] [engine] failed to flush chunk '1-1728380887.36789101.flb', retry in 11 seconds: task_id=0, input=forward.1 > output=syslog.0 (out_id=0)
	
2024-10-08T09:48:07.375Z
[2024/10/08 09:48:07] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
	
[2024/10/08 09:48:07] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
	
2024-10-08T09:48:07.376Z
	
[2024/10/08 09:48:07] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device
	
2024-10-08T09:48:08.355Z
	
[2024/10/08 09:48:08] [error] [output:syslog:syslog.0] no upstream connections available
	
2024-10-08T09:48:08.356Z
	
[2024/10/08 09:48:08] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
	
2024-10-08T09:48:08.356Z
	
[2024/10/08 09:48:08] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device
	
2024-10-08T09:48:08.356Z
	
[2024/10/08 09:48:08] [ warn] [engine] failed to flush chunk '1-1728380887.398592519.flb', retry in 9 seconds: task_id=1, input=forward.1 > output=syslog.0 (out_id=0)
	
2024-10-08T09:48:17.350Z
	
[2024/10/08 09:48:17] [error] [output:syslog:syslog.0] no upstream connections available
	
2024-10-08T09:48:17.350Z
	
[2024/10/08 09:48:17] [ warn] [engine] failed to flush chunk '1-1728380887.398592519.flb', retry in 16 seconds: task_id=1, input=forward.1 > output=syslog.0 (out_id=0)
	
2024-10-08T09:48:18.349Z
	
[2024/10/08 09:48:18] [error] [output:syslog:syslog.0] no upstream connections available
	
2024-10-08T09:48:18.349Z
	
[2024/10/08 09:48:18] [ warn] [engine] failed to flush chunk '1-1728380887.36789101.flb', retry in 18 seconds: task_id=0, input=forward.1 > output=syslog.0 (out_id=0)
	
2024-10-08T09:48:33.351Z
	
[2024/10/08 09:48:33] [error] [output:syslog:syslog.0] no upstream connections available
	
2024-10-08T09:48:33.351Z
	
[2024/10/08 09:48:33] [ warn] [engine] failed to flush chunk '1-1728380887.398592519.flb', retry in 20 seconds: task_id=1, input=forward.1 > output=syslog.0 (out_id=0)
	
2024-10-08T09:48:36.350Z
	
[2024/10/08 09:48:36] [error] [output:syslog:syslog.0] no upstream connections available
	
2024-10-08T09:48:36.350Z
	
[2024/10/08 09:48:36] [ warn] [engine] failed to flush chunk '1-1728380887.36789101.flb', retry in 29 seconds: task_id=0, input=forward.1 > output=syslog.0 (out_id=0)
	
2024-10-08T09:48:39.348Z
	
[2024/10/08 09:48:39] [engine] caught signal (SIGSEGV)```

same issue here, but deploying via CDK - attempting to revert to an older version now

Same issue here. Reverting to a prior version.

[2024/10/08 17:46:57] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory

But we have plenty of memory.

Thanks for reporting.

It's taking prod down intermittently and Nginx is throwing 502's when it starts shutting down before the ALB has a chance to determine it's unhealthy.

Switching to 2.32.2.20240820 fixed the issue for us.

That's the last time I rely on "Latest". Updated our Terraform to support a Variable for selecting a version, that way we can test in dev before an unknown like this gets pushed to live.

Same issue here, thanks for the info. We reverted to 2.32.2.20240820 as well.

Thanks for calling this out; could you let us know if the new latest image (2.32.2.20241008) still has this issue?

Fixed with the 20241008 patch