aws/aws-for-fluent-bit

2023 High Impact Issues Notice/Catalogue Ticket

PettitWesley opened this issue · 4 comments

AWS for Fluent Bit Q1 2023 High Impact Issues Notice for Customers

AWS for Fluent Bit team is aware of four high impact issues which the entire team is working to actively fix. We apologize for inconvenience caused by these. Keep checking the docs for updates.
https://github.com/aws/aws-for-fluent-bit/releases

Purpose

Customer messaging for customers using AWS for Fluent Bit who need to know if they are impacted by these issues and how to mitigate/resolve. This doc will continue to be updated with customer guidance.

This doc answer two questions:

  1. How do I know if I am impacted?
  2. How do I mitigate?

The purpose of this doc is not to troubleshoot or explain how the code caused these issues. Links will be added to other docs explaining that.

Known Issues

  1. CloudWatch Hang: [Resolved in 2.29.0+]
  2. Duplicate Tag Match SIGSEGV Issue: [Resolved in 2.31.2]
  3. Keepalive and Scheduler SIGSEGV: [Resolved in 2.31.3 ]
  4. S3 SIGSEGV related to asynchronous networking: [Resolved in 2.31.0+]
  5. S3 SIGSEGV with preserve_data_ordering option [under investigation]

Please check the specific section in this doc for each issue for an up to date description and list of known mitigations.

See: FAQ: Which version am I using and how do I change which version I am using? At the end of this doc.

GitHub Tracking Issue

#542

June 1st - Stable Version Upgraded to 2.31.11

As of June 1st, we have upgraded our stable version to 2.31.11. We now recommend this version or higher for all users.

Our stable version is marked here: https://github.com/aws/aws-for-fluent-bit/blob/mainline/AWS_FOR_FLUENT_BIT_STABLE_VERSION

Note on Out of Memory or OOMKill

While the CloudWatch Hang Issue did cause out of memory for some customers, this is not always the case. Furthermore, there are many possible causes of out of memory. None of the other issues noted in this doc should cause out of memory.

The most common cause of OOMKill is simply running under high throughput and the solution is to follow our guide here and update your settings: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/oomkill-prevention

CloudWatch Hang Issue

Fluent Bit can hang/freeze when the cloudwatch_logs output plugin is used, causing log loss. This generally only happens at very high throughput.

Sometimes, this will cause an out of memory or OOMKill.

See: Migrating to cloudwatch go plugin from cloudwatch_logs C plugin at the end of this doc.

Versions Impacted

See: FAQ: Which version am I using and how do I change which version I am using? at the end of this doc.

All versions prior to 2.28.4

Resolution

This issue was fixed in 2.29.0+

Please note the Duplicate Tag Match SIGSEGV Issue explained in this doc which was introduced in 2.29.0.

Mitigations

  1. Upgrade to 2.29.0+ if you do not have duplicate tag match patterns and will not be impacted by the Duplicate Tag Match SIGSEGV Issue outlined in this doc.
  2. Migrate to the cloudwatch go plugin which is not impacted in any version. See: Migrating to cloudwatch go plugin from cloudwatch_logs C plugin at the end of this doc.

Duplicate Tag Match SIGSEGV Issue

There is an issue when two outputs match the same log tags: https://docs.fluentbit.io/manual/concepts/key-concepts

It only occurs if one of the outputs in the duplicate pair is a cloudwatch_logs output.

For example:

[OUTPUT]
    Name cloudwatch_logs
    Alias Output1
    Match ServiceMetrics
    
[OUTPUT]
    Name {some other output}
    Alias Output2
    Match ServiceMetrics

Or for example:

[OUTPUT]
    Name cloudwatch_logs
    Alias Output1
    Match *
    
[OUTPUT]
    Name {some other output}
    Alias Output2
    Match *

Relevant Background information on FireLens Tags

https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#firelens-tag-and-match-pattern-and-generated-config

Versions Impacted

See: FAQ: Which version am I using and how do I change which version I am using? at the end of this doc.

This issue affects 2.29.0+ with cloudwatch_logs outputs. All current customer reports involve cloudwatch_logs and 2.29.0+.

If do not use any cloudwatch_logs outputs or use a version 2.28.4 or lower, the issue will not occur.

Resolved in: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.31.2

Resolution

https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.31.2

Mitigations

The following mitigations are recommended, in order of the likelihood that we expect them to reduce the frequency of the issue:

  1. Remove duplicate Match patterns. Segment outputs so each match different tags.
  2. Downgrade to 2.28.4 or lower
  3. Switch away from cloudwatch_logs output to the older cloudwatch output: See: Migrating to cloudwatch go plugin from cloudwatch_logs C plugin at the end of this doc.

Keepalive and Scheduler SIGSEGV

AWS is aware of an issue in the core networking and scheduler logic of Fluent Bit that causes it to crash with SIGSEGV.

Versions Impacted

See: FAQ: Which version am I using and how do I change which version I am using? at the end of this doc.

All versions are impacted. The issue is much more likely to occur in 2.29.0+ and all known customer reports are for 2.29.0+

The only way to know for sure if you are impacted by this issue is if you see a SIGSEGV with a stack trace like the following:

#4  0x00000000004fd80e in __mk_list_del at ...
#5  0x00000000004fd846 in mk_list_del at ...
#6  0x00000000004fe703 in prepare_destroy_conn at ...
#7  0x00000000004fe786 in prepare_destroy_conn_safe at ...

Resolution

This bug has been resolved in 2.31.3: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.31.3

Mitigations

Disable net.keepalive in your output configuration. This should prevent the issue from occurring:

[OUTPUT]
    ... other settings ...
    net.keepalive Off

https://docs.fluentbit.io/manual/administration/networking

AWS team believes that this issue impacts all versions, however, all current customer reports are for 2.29.0+, so downgrading may reduce its frequency. If you decide to downgrade, please read the notice in this doc about CloudWatch Hang.

S3 SIGSEGV related to asynchronous networking

This issue affects users of the Fluent Bit S3 output who enabled use_put_object: https://docs.fluentbit.io/manual/pipeline/outputs/s3

[OUTPUT]
    Name s3
    use_put_object On

The issue causes Fluent Bit to crash. It is not known to occur frequently.

Versions Impacted

All versions prior to 2.31.0.

See: FAQ: Which version am I using and how do I change which version I am using? At the end of this doc.

Resolution

Upgrade to 2.31.0.

S3 SIGSEGV with preserve_data_ordering option

Tracked here: #552

Mitigations

Suspected to be introduced in 2.31.1. Either downgrade, or turn the feature off:

preserve_data_ordering Off

Or, we have also released 2.31.4 and 2.31.5 with reverts of all recent S3 changes. These recent changes seem to have either introduced the issue or made it more frequent.

Migrating to cloudwatch go plugin from cloudwatch_logs C plugin

Please see:

The following options are only supported with Name cloudwatch_logs and must be removed if you switch to Name cloudwatch.

Example migration from cloudwatch_logs to cloudwatch:

[OUTPUT]
    Name                cloudwatch_logs
    Match               MyTag
    log_stream_prefix   my-prefix
    log_group_name      my-group
    auto_create_group   true
    auto_retry_requests true
    net.keepalive       Off
    workers             1

After migration:

[OUTPUT]
    Name                cloudwatch
    Match               MyTag
    log_stream_prefix   my-prefix
    log_group_name      my-group
    auto_create_group   true

The following options are only supported with Name cloudwatch and must be removed if you switch to Name cloudwatch_logs.

FAQ: Which version am I using and how do I change which version I am using?

The first log statement printed by AWS for Fluent Bit is always the version used:

AWS for Fluent Bit Container Image Version 2.28.4
Fluent Bit v1.9.9
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

Public Images

Public container images for aws-for-fluent-bit can be found on both:

The above are useful for finding the correct version/tag combination to use when a request to change version is required. Additional information related to public images and tags can be found at https://github.com/aws/aws-for-fluent-bit#public-images.

Created upstream issues for two of the underlying causes of the keepalive networking crash:

Issue for duplicate tag: fluent/fluent-bit#6849

Issue for part of the keepalive issue we think: fluent/fluent-bit#6838

We now recommend 2.31.11