2023 High Impact Issues Notice/Catalogue Ticket
PettitWesley opened this issue · 4 comments
AWS for Fluent Bit Q1 2023 High Impact Issues Notice for Customers
AWS for Fluent Bit team is aware of four high impact issues which the entire team is working to actively fix. We apologize for inconvenience caused by these. Keep checking the docs for updates.
https://github.com/aws/aws-for-fluent-bit/releases
Purpose
Customer messaging for customers using AWS for Fluent Bit who need to know if they are impacted by these issues and how to mitigate/resolve. This doc will continue to be updated with customer guidance.
This doc answer two questions:
- How do I know if I am impacted?
- How do I mitigate?
The purpose of this doc is not to troubleshoot or explain how the code caused these issues. Links will be added to other docs explaining that.
Known Issues
- CloudWatch Hang: [Resolved in 2.29.0+]
- Duplicate Tag Match SIGSEGV Issue: [Resolved in 2.31.2]
- Keepalive and Scheduler SIGSEGV: [Resolved in 2.31.3 ]
- S3 SIGSEGV related to asynchronous networking: [Resolved in 2.31.0+]
- S3 SIGSEGV with preserve_data_ordering option [under investigation]
Please check the specific section in this doc for each issue for an up to date description and list of known mitigations.
See: FAQ: Which version am I using and how do I change which version I am using? At the end of this doc.
GitHub Tracking Issue
June 1st - Stable Version Upgraded to 2.31.11
As of June 1st, we have upgraded our stable version to 2.31.11. We now recommend this version or higher for all users.
Our stable version is marked here: https://github.com/aws/aws-for-fluent-bit/blob/mainline/AWS_FOR_FLUENT_BIT_STABLE_VERSION
Note on Out of Memory or OOMKill
While the CloudWatch Hang Issue did cause out of memory for some customers, this is not always the case. Furthermore, there are many possible causes of out of memory. None of the other issues noted in this doc should cause out of memory.
The most common cause of OOMKill is simply running under high throughput and the solution is to follow our guide here and update your settings: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/oomkill-prevention
CloudWatch Hang Issue
Fluent Bit can hang/freeze when the cloudwatch_logs
output plugin is used, causing log loss. This generally only happens at very high throughput.
Sometimes, this will cause an out of memory or OOMKill.
See: Migrating to cloudwatch go plugin from cloudwatch_logs C plugin at the end of this doc.
Versions Impacted
See: FAQ: Which version am I using and how do I change which version I am using? at the end of this doc.
All versions prior to 2.28.4
Resolution
This issue was fixed in 2.29.0+
Please note the Duplicate Tag Match SIGSEGV Issue explained in this doc which was introduced in 2.29.0.
Mitigations
- Upgrade to 2.29.0+ if you do not have duplicate tag match patterns and will not be impacted by the Duplicate Tag Match SIGSEGV Issue outlined in this doc.
- Migrate to the
cloudwatch
go plugin which is not impacted in any version. See: Migrating to cloudwatch go plugin from cloudwatch_logs C plugin at the end of this doc.
Duplicate Tag Match SIGSEGV Issue
There is an issue when two outputs match the same log tags: https://docs.fluentbit.io/manual/concepts/key-concepts
It only occurs if one of the outputs in the duplicate pair is a cloudwatch_logs
output.
For example:
[OUTPUT]
Name cloudwatch_logs
Alias Output1
Match ServiceMetrics
[OUTPUT]
Name {some other output}
Alias Output2
Match ServiceMetrics
Or for example:
[OUTPUT]
Name cloudwatch_logs
Alias Output1
Match *
[OUTPUT]
Name {some other output}
Alias Output2
Match *
Relevant Background information on FireLens Tags
Versions Impacted
See: FAQ: Which version am I using and how do I change which version I am using? at the end of this doc.
This issue affects 2.29.0+ with cloudwatch_logs
outputs. All current customer reports involve cloudwatch_logs
and 2.29.0+.
If do not use any cloudwatch_logs
outputs or use a version 2.28.4 or lower, the issue will not occur.
Resolved in: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.31.2
Resolution
https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.31.2
Mitigations
The following mitigations are recommended, in order of the likelihood that we expect them to reduce the frequency of the issue:
- Remove duplicate Match patterns. Segment outputs so each match different tags.
- Downgrade to 2.28.4 or lower
- Switch away from
cloudwatch_logs
output to the oldercloudwatch
output: See: Migrating to cloudwatch go plugin from cloudwatch_logs C plugin at the end of this doc.
Keepalive and Scheduler SIGSEGV
AWS is aware of an issue in the core networking and scheduler logic of Fluent Bit that causes it to crash with SIGSEGV.
Versions Impacted
See: FAQ: Which version am I using and how do I change which version I am using? at the end of this doc.
All versions are impacted. The issue is much more likely to occur in 2.29.0+ and all known customer reports are for 2.29.0+
The only way to know for sure if you are impacted by this issue is if you see a SIGSEGV with a stack trace like the following:
#4 0x00000000004fd80e in __mk_list_del at ...
#5 0x00000000004fd846 in mk_list_del at ...
#6 0x00000000004fe703 in prepare_destroy_conn at ...
#7 0x00000000004fe786 in prepare_destroy_conn_safe at ...
Resolution
This bug has been resolved in 2.31.3: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.31.3
Mitigations
Disable net.keepalive in your output configuration. This should prevent the issue from occurring:
[OUTPUT]
... other settings ...
net.keepalive Off
https://docs.fluentbit.io/manual/administration/networking
AWS team believes that this issue impacts all versions, however, all current customer reports are for 2.29.0+, so downgrading may reduce its frequency. If you decide to downgrade, please read the notice in this doc about CloudWatch Hang.
S3 SIGSEGV related to asynchronous networking
This issue affects users of the Fluent Bit S3 output who enabled use_put_object
: https://docs.fluentbit.io/manual/pipeline/outputs/s3
[OUTPUT]
Name s3
use_put_object On
The issue causes Fluent Bit to crash. It is not known to occur frequently.
Versions Impacted
All versions prior to 2.31.0.
See: FAQ: Which version am I using and how do I change which version I am using? At the end of this doc.
Resolution
Upgrade to 2.31.0.
S3 SIGSEGV with preserve_data_ordering option
Tracked here: #552
Mitigations
Suspected to be introduced in 2.31.1. Either downgrade, or turn the feature off:
preserve_data_ordering Off
Or, we have also released 2.31.4 and 2.31.5 with reverts of all recent S3 changes. These recent changes seem to have either introduced the issue or made it more frequent.
Migrating to cloudwatch go plugin from cloudwatch_logs C plugin
Please see:
- https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#aws-go-plugins-vs-aws-core-c-plugins
- https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit
- https://docs.fluentbit.io/manual/pipeline/outputs/cloudwatch
The following options are only supported with Name
cloudwatch_logs
and must be removed if you switch to Name
cloudwatch
.
metric_namespace
metric_dimensions
auto_retry_requests
workers
- The networking settings noted here: https://docs.fluentbit.io/manual/administration/networking
net.connect_timeout
net.connect_timeout_log_error
net.dns.mode
net.dns.prefer_ipv4
net.dns.resolver
net.keepalive
net.keepalive_idle_timeout
net.keepalive_max_recycle
net.source_address
- If you use log group or stream name templating, each plugin has some support for this but the features and config option names are entirely different.
- https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit#templating-log-group-and-stream-names
- https://docs.fluentbit.io/manual/pipeline/outputs/cloudwatch#log-stream-and-group-name-templating-using-record_accessor-syntax
- With
cloudwatch
you can put$()
template variables in thelog_group_name
andlog_stream_name
options. You can then usedefault_log_group_name
anddefault_log_stream_name
as fallback names if templating fails.- Only
cloudwatch
supports direct templating of ECS metadata when you run in ECS:$(ecs_task_id)
,$(ecs_cluster
or$(ecs_task_arn)
. Withcloudwatch_logs
you can only inject values from the log JSONs. If you want to use ECS Metadata in your config withcloudwatch_logs
please see: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/init-metadata
- Only
- With
cloudwatch_logs
templates go in thelog_group_template
orlog_stream_template
and use a$var
syntax (see doc). Fallback names if templating fails go in thelog_group_name
,log_stream_name
, orlog_stream_prefix
options.
Example migration from cloudwatch_logs
to cloudwatch
:
[OUTPUT]
Name cloudwatch_logs
Match MyTag
log_stream_prefix my-prefix
log_group_name my-group
auto_create_group true
auto_retry_requests true
net.keepalive Off
workers 1
After migration:
[OUTPUT]
Name cloudwatch
Match MyTag
log_stream_prefix my-prefix
log_group_name my-group
auto_create_group true
The following options are only supported with Name
cloudwatch
and must be removed if you switch to Name
cloudwatch_logs
.
default_log_group_name
default_log_stream_name
new_log_group_tags
credentials_endpoint
- If you use log group or stream name templating, each plugin has some support for this but the features and config option names are entirely different.
- https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit#templating-log-group-and-stream-names
- https://docs.fluentbit.io/manual/pipeline/outputs/cloudwatch#log-stream-and-group-name-templating-using-record_accessor-syntax
- With
cloudwatch
you can put$()
template variables in thelog_group_name
andlog_stream_name
options. You can then usedefault_log_group_name
anddefault_log_stream_name
as fallback names if templating fails.- Only
cloudwatch
supports direct templating of ECS metadata when you run in ECS:$(ecs_task_id)
,$(ecs_cluster
or$(ecs_task_arn)
. Withcloudwatch_logs
you can only inject values from the log JSONs. If you want to use ECS Metadata in your config withcloudwatch_logs
please see: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/init-metadata
- Only
- With
cloudwatch_logs
templates go in thelog_group_template
orlog_stream_template
and use a$var
syntax (see doc). Fallback names if templating fails go in thelog_group_name
,log_stream_name
, orlog_stream_prefix
options.
FAQ: Which version am I using and how do I change which version I am using?
The first log statement printed by AWS for Fluent Bit is always the version used:
AWS for Fluent Bit Container Image Version 2.28.4
Fluent Bit v1.9.9
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
Public Images
Public container images for aws-for-fluent-bit can be found on both:
- Public ECR - https://gallery.ecr.aws/aws-observability/aws-for-fluent-bit
- Dockerhub - https://hub.docker.com/r/amazon/aws-for-fluent-bit/tags
The above are useful for finding the correct version/tag combination to use when a request to change version is required. Additional information related to public images and tags can be found at https://github.com/aws/aws-for-fluent-bit#public-images.
- Our stable version is marked here: https://github.com/aws/aws-for-fluent-bit/blob/mainline/AWS_FOR_FLUENT_BIT_STABLE_VERSION
- Our release notes are here: https://github.com/aws/aws-for-fluent-bit/releases
Created upstream issues for two of the underlying causes of the keepalive networking crash:
Issue for duplicate tag: fluent/fluent-bit#6849
Issue for part of the keepalive issue we think: fluent/fluent-bit#6838
We now recommend 2.31.11