Version 2.32.2.20241003 Cannot allocate memory and get killed with code 139
Closed this issue ยท 12 comments
Describe the question/issue
After getting the last tag of the fluent bit container, the log router sidecar started getting killed with error code 139, this is being used in ECS as an essential container, making all the tasks die.
Configuration
{
"name": "log_router",
"image": "amazon/aws-for-fluent-bit",
"cpu": 0,
"portMappings": [],
"essential": true,
"environment": [],
"mountPoints": [],
"volumesFrom": [],
"user": "0",
"systemControls": [],
"firelensConfiguration": {
"type": "fluentbit",
"options": {
"enable-ecs-log-metadata": "true"
}
}
}
Fluent Bit Log Output
Timestamp (UTC-03:00) | Message | Container
-- | -- | --
log_router
October 07, 2024 at 19:21 (UTC-3:00) | [2024/10/07 22:21:03] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:49] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:49] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:26] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:26] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [sp] stream processor started | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [fluent bit] version=1.9.10, commit=760956f50c, pid=1 | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128 | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [cmetrics] version=0.3.7 | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877 | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224 | log_router
October 07, 2024 at 19:20 (UTC-3:00) | [2024/10/07 22:20:11] [ info] [output:null:null.0] worker #0 started | log_router
October 07, 2024 at 19:20 (UTC-3:00) | Fluent Bit v1.9.10 | log_router
October 07, 2024 at 19:20 (UTC-3:00) | * Copyright (C) 2015-2022 The Fluent Bit Authors | log_router
October 07, 2024 at 19:20 (UTC-3:00) | * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd | log_router
October 07, 2024 at 19:20 (UTC-3:00) | * https://fluentbit.io | log_router
October 07, 2024 at 19:20 (UTC-3:00) | AWS for Fluent Bit Container Image Version 2.32.2.20241003 | log_router
Fluent Bit Version Info
Which AWS for Fluent Bit Versions have you tried?*
latest - 2.32.2.20241003
Which versions have you seen the issue in? Are there any versions where you do not see the issue?
2.32.2.20241003
If you are experiencing a bug, please consider upgrading to the newest release: https://github.com/aws/aws-for-fluent-bit/releases
Or, try downgrading to the latest stable version: https://github.com/aws/aws-for-fluent-bit/blob/mainline/AWS_FOR_FLUENT_BIT_STABLE_VERSION
Cluster Details
- what is the networking setup?
awsvpc mode - do you use App Mesh or a service mesh?
No - does you use VPC endpoints in a network restricted VPC?
No - Is throttling from the destination part of the problem? Please note that occasional transient network connection errors are often caused by exceeding limits. For example, CW API can block/drop Fluent Bit connections when throttling is triggered.
No - ECS or EKS
ECS - Fargate or EC2
Fargate - Daemon or Sidecar deployment for Fluent Bit
Sidecar
Application Details
We only push ERROR logs into Datadog
Steps to reproduce issue
When using latest image tag, container dies with error code 139 after few minutes, when used stable
tag, container works without problem
Related Issues
Are there any related/similar aws/aws-for-fluent-bit or fluent/fluent-bit GitHub issues?
No related issues
We are seeing something very similar on our end as well as of earlier today. We have removed the logrouter as an essential container and are currently monitoring if our other tasks remain up.
I experienced the same today and reverted to the version v2.32.2.20240820
bump. Same issue here. Caused a small prod outage
I experienced the same today and reverted to the version v2.32.2.20240820
We ended up doing the same to resolve this.
Same output from a cloudwatch log group (hence the garbled lines)
2024-10-08T09:48:06.219Z
Fluent Bit v1.9.10
2024-10-08T09:48:06.219Z
* Copyright (C) 2015-2022 The Fluent Bit Authors
2024-10-08T09:48:06.219Z
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
2024-10-08T09:48:06.219Z
* https://fluentbit.io
2024-10-08T09:48:06.243Z
[2024/10/08 09:48:06] [ info] [fluent bit] version=1.9.10, commit=760956f50c, pid=1
2024-10-08T09:48:06.243Z
[2024/10/08 09:48:06] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
2024-10-08T09:48:06.243Z
[2024/10/08 09:48:06] [ info] [cmetrics] version=0.3.7
2024-10-08T09:48:06.243Z
[2024/10/08 09:48:06] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877
2024-10-08T09:48:06.243Z
[2024/10/08 09:48:06] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock
2024-10-08T09:48:06.244Z
[2024/10/08 09:48:06] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224
2024-10-08T09:48:06.244Z
[2024/10/08 09:48:06] [ info] [output:syslog:syslog.0] setup done for 127.0.0.1:1514 (TLS=off)
2024-10-08T09:48:06.245Z
[2024/10/08 09:48:06] [ info] [output:null:null.1] worker #0 started
2024-10-08T09:48:06.264Z
[2024/10/08 09:48:06] [ info] [sp] stream processor started
2024-10-08T09:48:07.351Z
[2024/10/08 09:48:07] [error] [output:syslog:syslog.0] no upstream connections available
2024-10-08T09:48:07.360Z
[2024/10/08 09:48:07] [ warn] [engine] failed to flush chunk '1-1728380887.36789101.flb', retry in 11 seconds: task_id=0, input=forward.1 > output=syslog.0 (out_id=0)
2024-10-08T09:48:07.375Z
[2024/10/08 09:48:07] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
[2024/10/08 09:48:07] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
2024-10-08T09:48:07.376Z
[2024/10/08 09:48:07] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device
2024-10-08T09:48:08.355Z
[2024/10/08 09:48:08] [error] [output:syslog:syslog.0] no upstream connections available
2024-10-08T09:48:08.356Z
[2024/10/08 09:48:08] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
2024-10-08T09:48:08.356Z
[2024/10/08 09:48:08] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device
2024-10-08T09:48:08.356Z
[2024/10/08 09:48:08] [ warn] [engine] failed to flush chunk '1-1728380887.398592519.flb', retry in 9 seconds: task_id=1, input=forward.1 > output=syslog.0 (out_id=0)
2024-10-08T09:48:17.350Z
[2024/10/08 09:48:17] [error] [output:syslog:syslog.0] no upstream connections available
2024-10-08T09:48:17.350Z
[2024/10/08 09:48:17] [ warn] [engine] failed to flush chunk '1-1728380887.398592519.flb', retry in 16 seconds: task_id=1, input=forward.1 > output=syslog.0 (out_id=0)
2024-10-08T09:48:18.349Z
[2024/10/08 09:48:18] [error] [output:syslog:syslog.0] no upstream connections available
2024-10-08T09:48:18.349Z
[2024/10/08 09:48:18] [ warn] [engine] failed to flush chunk '1-1728380887.36789101.flb', retry in 18 seconds: task_id=0, input=forward.1 > output=syslog.0 (out_id=0)
2024-10-08T09:48:33.351Z
[2024/10/08 09:48:33] [error] [output:syslog:syslog.0] no upstream connections available
2024-10-08T09:48:33.351Z
[2024/10/08 09:48:33] [ warn] [engine] failed to flush chunk '1-1728380887.398592519.flb', retry in 20 seconds: task_id=1, input=forward.1 > output=syslog.0 (out_id=0)
2024-10-08T09:48:36.350Z
[2024/10/08 09:48:36] [error] [output:syslog:syslog.0] no upstream connections available
2024-10-08T09:48:36.350Z
[2024/10/08 09:48:36] [ warn] [engine] failed to flush chunk '1-1728380887.36789101.flb', retry in 29 seconds: task_id=0, input=forward.1 > output=syslog.0 (out_id=0)
2024-10-08T09:48:39.348Z
[2024/10/08 09:48:39] [engine] caught signal (SIGSEGV)```
same issue here, but deploying via CDK - attempting to revert to an older version now
Same issue here. Reverting to a prior version.
[2024/10/08 17:46:57] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
But we have plenty of memory.
Thanks for reporting.
It's taking prod down intermittently and Nginx is throwing 502's when it starts shutting down before the ALB has a chance to determine it's unhealthy.
Switching to 2.32.2.20240820 fixed the issue for us.
That's the last time I rely on "Latest". Updated our Terraform to support a Variable for selecting a version, that way we can test in dev before an unknown like this gets pushed to live.
Same issue here, thanks for the info. We reverted to 2.32.2.20240820 as well.
Thanks for calling this out; could you let us know if the new latest
image (2.32.2.20241008) still has this issue?
Fixed with the 20241008 patch