aws/aws-for-fluent-bit

Loki Output to Grafana Cloud - Consistent SIGSEGV

conbon opened this issue · 2 comments

conbon commented

Describe the question/issue

I am running several containers in ECS Fargate with the logConfiguration set to "awsfirelens".
I am also configuring a fluent-bit container (using the aws-for-fluent-bit) to override/template configuration - mainly to allow us to set the Mem_Buf_Limit.

I can see logs coming through in Grafana Cloud as normal, but as I bump the load, I very quickly get a SIGSEGV in the cloudwatch logs for fluent-bit container and the whole task exits.

Configuration

Dockerfile:

ARG UPSTREAM_IMAGE_TAG
FROM amazon/aws-for-fluent-bit:${UPSTREAM_IMAGE_TAG}
ADD fluent-bit.conf /fluent-bit/alt/fluent-bit.conf
CMD ["/fluent-bit/bin/fluent-bit", "-c", "/fluent-bit/alt/fluent-bit.conf"]

fluent-bit.conf:

[SERVICE]
    Grace 5
    Flush 5

# This is the required input to receive container stdout & stderr logs
# with FireLens
[INPUT]
    Name forward
    unix_path /var/run/fluent.sock
    # default memory buffer only for logs collected by this input
    storage.type memory
    # Total Max Memory Usage <= 2 * SUM(Each input Mem_Buf_Limit)
    Mem_Buf_Limit ${mem_buf_limit}

[Output]
    Name loki
    Match *
    tls         on
    tls.verify  on
    host ${loki_host}
    port ${loki_port}
    http_user ${loki_user}
    http_passwd ${loki_passwd}
    labels ${loki_labels}
    label_keys  $container_name
    line_format key_value
    remove_keys ecs_cluster, ecs_task_definition, container_id

Partial ECS Task Definition (there are more containers present):

{
            "name": "redis",
            "image": "redis:6.2.13-alpine",
            "repositoryCredentials": {
                "credentialsParameter": "xxx"
            },
            "cpu": 0,
            "portMappings": [],
            "essential": true,
            "command": [
                "redis-server",
                "--port",
                "6379",
                "--protected-mode",
                "no",
                "--tcp-backlog",
                "128",
                "--loglevel",
                "notice",
                "--save",
                "",
                "--maxclients",
                "6144",
                "--maxmemory",
                "256mb"
            ],
            "environment": [],
            "mountPoints": [],
            "volumesFrom": [],
            "linuxParameters": {
                "capabilities": {
                    "add": [],
                    "drop": []
                },
                "devices": [],
                "initProcessEnabled": true,
                "tmpfs": []
            },
            "readonlyRootFilesystem": false,
            "ulimits": [
                {
                    "name": "nofile",
                    "softLimit": 8192,
                    "hardLimit": 8192
                }
            ],
            "logConfiguration": {
                "logDriver": "awsfirelens"
            },
            "healthCheck": {
                "command": [
                    "CMD-SHELL",
                    "redis-cli ping | grep -Eq '^PONG\\s*$' || exit 1"
                ],
                "interval": 30,
                "timeout": 5,
                "retries": 3,
                "startPeriod": 0
            }
        },
        {
            "name": "fluent-bit",
            "image": "fluent-bit:pr-stable",
            "repositoryCredentials": {
                "credentialsParameter": "xxx"
            },
            "cpu": 0,
            "memory": 75,
            "portMappings": [],
            "essential": true,
            "environment": [
                {
                    "name": "FLB_LOG_LEVEL",
                    "value": "debug"
                },
                {
                    "name": "mem_buf_limit",
                    "value": "30MB"
                },
                {
                    "name": "loki_host",
                    "value": "logs-prod-eu-west-0.grafana.net"
                },
                {
                    "name": "loki_port",
                    "value": "443"
                },
                {
                    "name": "loki_user",
                    "value": "xxx"
                },
                                {
                    "name": "loki_passwd",
                    "value": "xxx"
                },
                {
                    "name": "loki_labels",
                    "value": "env=dev,network=test"
                }
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "linuxParameters": {
                "capabilities": {
                    "add": [],
                    "drop": []
                },
                "devices": [],
                "initProcessEnabled": true,
                "tmpfs": []
            },
            "user": "0",
            "readonlyRootFilesystem": false,
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-create-group": "true",
                    "awslogs-group": "/log-router",
                    "awslogs-region": "eu-west-1",
                    "awslogs-stream-prefix": "ecs"
                }
            },
            "firelensConfiguration": {
                "type": "fluentbit",
                "options": {
                    "enable-ecs-log-metadata": "true",
                    "config-file-type": "file",
                    "config-file-value": "/fluent-bit/alt/fluent-bit.conf"
                }
            }
        },

Fluent Bit Log Output

20 November 2023 at 16:02 (UTC)	#13 0x4e2ef7 in output_pre_cb_flush() at include/fluent-bit/flb_output.h:522	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	#14 0xa4fea6 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	#15 0xffffffffffffffff in ???() at ???:0	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	#3 0x45f6da in arena_dalloc_large() at lib/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:281	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	#4 0x45f6da in arena_dalloc() at lib/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:323	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	#5 0x45f6da in idalloctm() at lib/jemalloc-5.2.1/include/jemalloc/internal/jemalloc_internal_inlines_c.h:118	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	#6 0x45f6da in ifree() at lib/jemalloc-5.2.1/src/jemalloc.c:2589	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	#7 0x45f6da in je_free_default() at lib/jemalloc-5.2.1/src/jemalloc.c:2799	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	#8 0x4dbd22 in flb_free() at include/fluent-bit/flb_mem.h:120	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	#9 0x4dd014 in flb_sds_destroy() at src/flb_sds.c:470	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	#10 0x5b41ea in pack_record() at plugins/out_loki/loki.c:992	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	#11 0x5b4659 in loki_compose_payload() at plugins/out_loki/loki.c:1140	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	#12 0x5b4738 in cb_loki_flush() at plugins/out_loki/loki.c:1167	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	#0 0x49e372 in atomic_load_p() at lib/jemalloc-5.2.1/include/jemalloc/internal/atomic.h:62	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	#1 0x49e372 in extent_arena_get() at lib/jemalloc-5.2.1/include/jemalloc/internal/extent_inlines.h:51	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	#2 0x49e372 in je_large_dalloc() at lib/jemalloc-5.2.1/src/large.c:361	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	[2023/11/20 16:02:21] [debug] [task] created task=0x7efcd9a41c50 id=2 OK	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	[2023/11/20 16:02:21] [debug] [task] created task=0x7efcd9a427b0 id=17 OK	adf33da5fc3c4937979427eff533b933	fluent-bit
20 November 2023 at 16:02 (UTC)	[2023/11/20 16:02:21] [engine] caught signal (SIGSEGV)

Fluent Bit Version Info

Which AWS for Fluent Bit Versions have you tried?*

I have tried a whole list of versions:

  • 2.23.0
  • 2.32.0
  • stable
  • latest
    • more

Cluster Details

  • fargate with OOTB service discovery
  • 10 containers per task (including fluent-bit)

Application Details

We are attempting to keep the fluent-bit container under 100MB hard docker limit & therefore need to configure the Mem_Buf_Limit.
We have set this at 30MB currently due to info indicating:
Total Max Memory Usage <= 2 * SUM(Each input Mem_Buf_Limit)

We seem to have the same issue at our end. where the log router failed when we are trying to route the data to multiple destinations one being cloudwatch and another to our OSS loki implemenation.

This behavior is fixed in a more up-to-date version of Fluent Bit, past v2.0.7 I believe, see fluent/fluent-bit@a93117c

The latest version of AWS for Fluent Bit (2.32.0) only includes Fluent Bit @ 1.9.10