Loki Output to Grafana Cloud - Consistent SIGSEGV
conbon opened this issue · 2 comments
Describe the question/issue
I am running several containers in ECS Fargate with the logConfiguration set to "awsfirelens".
I am also configuring a fluent-bit container (using the aws-for-fluent-bit) to override/template configuration - mainly to allow us to set the Mem_Buf_Limit.
I can see logs coming through in Grafana Cloud as normal, but as I bump the load, I very quickly get a SIGSEGV in the cloudwatch logs for fluent-bit container and the whole task exits.
Configuration
Dockerfile:
ARG UPSTREAM_IMAGE_TAG
FROM amazon/aws-for-fluent-bit:${UPSTREAM_IMAGE_TAG}
ADD fluent-bit.conf /fluent-bit/alt/fluent-bit.conf
CMD ["/fluent-bit/bin/fluent-bit", "-c", "/fluent-bit/alt/fluent-bit.conf"]
fluent-bit.conf:
[SERVICE]
Grace 5
Flush 5
# This is the required input to receive container stdout & stderr logs
# with FireLens
[INPUT]
Name forward
unix_path /var/run/fluent.sock
# default memory buffer only for logs collected by this input
storage.type memory
# Total Max Memory Usage <= 2 * SUM(Each input Mem_Buf_Limit)
Mem_Buf_Limit ${mem_buf_limit}
[Output]
Name loki
Match *
tls on
tls.verify on
host ${loki_host}
port ${loki_port}
http_user ${loki_user}
http_passwd ${loki_passwd}
labels ${loki_labels}
label_keys $container_name
line_format key_value
remove_keys ecs_cluster, ecs_task_definition, container_id
Partial ECS Task Definition (there are more containers present):
{
"name": "redis",
"image": "redis:6.2.13-alpine",
"repositoryCredentials": {
"credentialsParameter": "xxx"
},
"cpu": 0,
"portMappings": [],
"essential": true,
"command": [
"redis-server",
"--port",
"6379",
"--protected-mode",
"no",
"--tcp-backlog",
"128",
"--loglevel",
"notice",
"--save",
"",
"--maxclients",
"6144",
"--maxmemory",
"256mb"
],
"environment": [],
"mountPoints": [],
"volumesFrom": [],
"linuxParameters": {
"capabilities": {
"add": [],
"drop": []
},
"devices": [],
"initProcessEnabled": true,
"tmpfs": []
},
"readonlyRootFilesystem": false,
"ulimits": [
{
"name": "nofile",
"softLimit": 8192,
"hardLimit": 8192
}
],
"logConfiguration": {
"logDriver": "awsfirelens"
},
"healthCheck": {
"command": [
"CMD-SHELL",
"redis-cli ping | grep -Eq '^PONG\\s*$' || exit 1"
],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 0
}
},
{
"name": "fluent-bit",
"image": "fluent-bit:pr-stable",
"repositoryCredentials": {
"credentialsParameter": "xxx"
},
"cpu": 0,
"memory": 75,
"portMappings": [],
"essential": true,
"environment": [
{
"name": "FLB_LOG_LEVEL",
"value": "debug"
},
{
"name": "mem_buf_limit",
"value": "30MB"
},
{
"name": "loki_host",
"value": "logs-prod-eu-west-0.grafana.net"
},
{
"name": "loki_port",
"value": "443"
},
{
"name": "loki_user",
"value": "xxx"
},
{
"name": "loki_passwd",
"value": "xxx"
},
{
"name": "loki_labels",
"value": "env=dev,network=test"
}
],
"mountPoints": [],
"volumesFrom": [],
"linuxParameters": {
"capabilities": {
"add": [],
"drop": []
},
"devices": [],
"initProcessEnabled": true,
"tmpfs": []
},
"user": "0",
"readonlyRootFilesystem": false,
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-create-group": "true",
"awslogs-group": "/log-router",
"awslogs-region": "eu-west-1",
"awslogs-stream-prefix": "ecs"
}
},
"firelensConfiguration": {
"type": "fluentbit",
"options": {
"enable-ecs-log-metadata": "true",
"config-file-type": "file",
"config-file-value": "/fluent-bit/alt/fluent-bit.conf"
}
}
},
Fluent Bit Log Output
20 November 2023 at 16:02 (UTC) #13 0x4e2ef7 in output_pre_cb_flush() at include/fluent-bit/flb_output.h:522 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) #14 0xa4fea6 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) #15 0xffffffffffffffff in ???() at ???:0 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) #3 0x45f6da in arena_dalloc_large() at lib/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:281 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) #4 0x45f6da in arena_dalloc() at lib/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:323 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) #5 0x45f6da in idalloctm() at lib/jemalloc-5.2.1/include/jemalloc/internal/jemalloc_internal_inlines_c.h:118 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) #6 0x45f6da in ifree() at lib/jemalloc-5.2.1/src/jemalloc.c:2589 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) #7 0x45f6da in je_free_default() at lib/jemalloc-5.2.1/src/jemalloc.c:2799 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) #8 0x4dbd22 in flb_free() at include/fluent-bit/flb_mem.h:120 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) #9 0x4dd014 in flb_sds_destroy() at src/flb_sds.c:470 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) #10 0x5b41ea in pack_record() at plugins/out_loki/loki.c:992 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) #11 0x5b4659 in loki_compose_payload() at plugins/out_loki/loki.c:1140 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) #12 0x5b4738 in cb_loki_flush() at plugins/out_loki/loki.c:1167 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) #0 0x49e372 in atomic_load_p() at lib/jemalloc-5.2.1/include/jemalloc/internal/atomic.h:62 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) #1 0x49e372 in extent_arena_get() at lib/jemalloc-5.2.1/include/jemalloc/internal/extent_inlines.h:51 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) #2 0x49e372 in je_large_dalloc() at lib/jemalloc-5.2.1/src/large.c:361 adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) [2023/11/20 16:02:21] [debug] [task] created task=0x7efcd9a41c50 id=2 OK adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) [2023/11/20 16:02:21] [debug] [task] created task=0x7efcd9a427b0 id=17 OK adf33da5fc3c4937979427eff533b933 fluent-bit
20 November 2023 at 16:02 (UTC) [2023/11/20 16:02:21] [engine] caught signal (SIGSEGV)
Fluent Bit Version Info
Which AWS for Fluent Bit Versions have you tried?*
I have tried a whole list of versions:
- 2.23.0
- 2.32.0
- stable
- latest
-
- more
Cluster Details
- fargate with OOTB service discovery
- 10 containers per task (including fluent-bit)
Application Details
We are attempting to keep the fluent-bit container under 100MB hard docker limit & therefore need to configure the Mem_Buf_Limit.
We have set this at 30MB currently due to info indicating:
Total Max Memory Usage <= 2 * SUM(Each input Mem_Buf_Limit)
We seem to have the same issue at our end. where the log router failed when we are trying to route the data to multiple destinations one being cloudwatch and another to our OSS loki implemenation.
This behavior is fixed in a more up-to-date version of Fluent Bit, past v2.0.7 I believe, see fluent/fluent-bit@a93117c
The latest version of AWS for Fluent Bit (2.32.0) only includes Fluent Bit @ 1.9.10