fluentd failed to write to elasticsearch with warns and call stack in log after installing OCP logging operators
lihongbj opened this issue · 1 comments
Describe the bug
Fluentd failed to write to elasticsearch with warns and call stack in log after installing OCP logging operators( OpenShift Elasticsearch Operator and Red Hat OpenShift Logging Operator), either with default configmap/collector
or with added kong input source.
Environment
- OCP : 4.10.37
- 2 logging operator : 5.5.4
- fluentd: registry.redhat.io/openshift-logging/fluentd-rhel8@sha256:842077788b4434800127d63b4cd5d8cfaa1cfd3ca1dfd8439de30c6e8ebda884
- elasticsearch plugin in fluentd: 5.2.2
- elasticsearch : 6.1.8
- ClusterLogging instance:
apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"logging.openshift.io/v1","kind":"ClusterLogging","metadata":{"annotations":{},"name":"instance","namespace":"openshift-logging"},"spec":{"collection":{"logs":{"fluentd":{},"type":"fluentd"}},"curation":{"curator":{"schedule":"30 3 * * *"},"type":"curator"},"logStore":{"elasticsearch":{"nodeCount":3,"redundancyPolicy":"SingleRedundancy","resources":{"limits":{"memory":"2Gi"},"requests":{"cpu":"200m","memory":"2Gi"}},"storage":{"size":"20G","storageClassName":"rook-cephfs"}},"retentionPolicy":{"application":{"maxAge":"1d"},"audit":{"maxAge":"7d"},"infra":{"maxAge":"7d"}},"type":"elasticsearch"},"managementState":"Managed","visualization":{"kibana":{"replicas":1},"type":"kibana"}}}
creationTimestamp: "2022-11-22T13:47:15Z"
generation: 2
name: instance
namespace: openshift-logging
resourceVersion: "62106345"
uid: a99791f3-0edf-4f66-a81a-ff476a55b978
spec:
collection:
logs:
fluentd: {}
type: fluentd
curation:
curator:
schedule: 30 3 * * *
type: curator
logStore:
elasticsearch:
nodeCount: 3
redundancyPolicy: SingleRedundancy
resources:
limits:
memory: 2Gi
requests:
cpu: 200m
memory: 2Gi
storage:
size: 20G
storageClassName: rook-cephfs
retentionPolicy:
application:
maxAge: 1d
audit:
maxAge: 7d
infra:
maxAge: 7d
type: elasticsearch
managementState: Unmanaged
visualization:
kibana:
replicas: 1
type: kibana
status:
collection:
logs:
fluentdStatus:
daemonSet: collector
nodes:
collector-5qllx: worker4.mycluster.com
collector-5vxls: worker2.mycluster.com
...
collector-sphfk: worker3.mycluster.com
pods:
failed: []
notReady: []
ready:
- collector-26qt4
- collector-49k6v
- ...
- collector-sphfk
conditions:
- lastTransitionTime: "2022-11-22T13:47:33Z"
status: "False"
type: CollectorDeadEnd
curation: {}
logStore:
elasticsearchStatus:
- cluster:
activePrimaryShards: 11
activeShards: 22
initializingShards: 0
numDataNodes: 3
numNodes: 3
pendingTasks: 0
relocatingShards: 0
status: green
unassignedShards: 0
clusterName: elasticsearch
nodeConditions:
elasticsearch-cdm-9e96hery-1: []
elasticsearch-cdm-9e96hery-2: []
elasticsearch-cdm-9e96hery-3: []
nodeCount: 3
pods:
client:
failed: []
notReady: []
ready:
- elasticsearch-cdm-9e96hery-1-555c9fbc65-7pvxs
- elasticsearch-cdm-9e96hery-2-5bc76fb6d4-bqk2d
- elasticsearch-cdm-9e96hery-3-8656658955-b4kxf
data:
failed: []
notReady: []
ready:
- elasticsearch-cdm-9e96hery-1-555c9fbc65-7pvxs
- elasticsearch-cdm-9e96hery-2-5bc76fb6d4-bqk2d
- elasticsearch-cdm-9e96hery-3-8656658955-b4kxf
master:
failed: []
notReady: []
ready:
- elasticsearch-cdm-9e96hery-1-555c9fbc65-7pvxs
- elasticsearch-cdm-9e96hery-2-5bc76fb6d4-bqk2d
- elasticsearch-cdm-9e96hery-3-8656658955-b4kxf
shardAllocationEnabled: all
visualization:
kibanaStatus:
- deployment: kibana
pods:
failed: []
notReady: []
ready:
- kibana-bdcd6f9c8-lcc2q
replicaSets:
- kibana-bdcd6f9c8
replicas: 1
Logs
19 2022-11-28 08:02:13 +0000 [warn]: [retry_default] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=20.04924575588666 slow_flush_log_threshold=20.0 plugin_id="retry_default"
20 2022-11-28 08:02:32 +0000 [warn]: [retry_default] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=36.76959209097549 slow_flush_log_threshold=20.0 plugin_id="retry_default"
21 2022-11-28 08:02:42 +0000 [warn]: [retry_default] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=27.348451503785327 slow_flush_log_threshold=20.0 plugin_id="retry_default"
2022-11-28 08:06:16 +0000 [warn]: [retry_default] failed to flush the buffer. retry_times=0 next_retry_time=2022-11-28 08:06:18 +0000 chunk="5ee8353c804dadb68de79a7c7db572e2" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch\", :port=>9200, :scheme=>\"https\"}): [502] "
23 2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluent-plugin-elasticsearch-5.2.2/lib/fluent/plugin/out_elasticsearch.rb:1139:in `rescue in send_bulk'
24 2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluent-plugin-elasticsearch-5.2.2/lib/fluent/plugin/out_elasticsearch.rb:1101:in `send_bulk'
25 2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluent-plugin-elasticsearch-5.2.2/lib/fluent/plugin/out_elasticsearch.rb:879:in `block in write'
26 2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluent-plugin-elasticsearch-5.2.2/lib/fluent/plugin/out_elasticsearch.rb:878:in `each'
27 2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluent-plugin-elasticsearch-5.2.2/lib/fluent/plugin/out_elasticsearch.rb:878:in `write'
28 2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:1179:in `try_flush'
29 2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:1500:in `flush_thread_run'
30 2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:499:in `block (2 levels) in start'
31 2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.14.6/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
Expected behavior
fluentd succeeds when to write log to elasticsearch.
Actual behavior
fluentd failed to write to elasticsearch with warns and call stack in log.
To Reproduce
Steps to reproduce the behavior:
- install ceph rook storage class and make
rook-cephfs
as default, - install
OpenShift Elasticsearch
Operator andRed Hat OpenShift Logging
Operator per https://docs.openshift.com/container-platform/4.10/logging/cluster-logging-deploying.html in OCP 4.10.37, - create ClusterLoggings instance with
storageClassName
asrook-cephfs
, - patch
ClusterLoggings/instance
, change itsspec.managementState
asUnmanaged
- add API gateway kong as one of fluentd log input source into
configmap/collector
after<system>
:
fluent.conf: |-
## CLO GENERATED CONFIGURATION ###
# This file is a copy of the fluentd configuration entrypoint
# which should normally be supplied in a configmap.
<system>
log_level "#{ENV['LOG_LEVEL'] || 'warn'}"
</system>
<source>
@type udp
tag kong
format json
</source>
<match kong>
@type elasticsearch
host elasticsearch.openshift-logging.svc.cluster.local
port 9200
verify_es_version_at_startup false
scheme https
ssl_version TLSv1_2
index_name kong-000001
client_key '/var/run/ocp-collector/secrets/collector/tls.key'
client_cert '/var/run/ocp-collector/secrets/collector/tls.crt'
ca_file '/var/run/ocp-collector/secrets/collector/ca-bundle.crt'
</match>
- after api calling flowed through kong, the kong api log can be found in fluentd log, but elasticsearch error is found followingly as above log
Additional context
No.
Your collector shows it received a "502" which is an indicator there is an issue on the receiving side. It shows you configured a 3 node ES cluster with the following:
logStore:
elasticsearch:
nodeCount: 3
redundancyPolicy: SingleRedundancy
resources:
limits:
memory: 2Gi
requests:
cpu: 200m
memory: 2Gi
This means you gave ES literally no CPU for which it needs to process logs. Additionally, each instance is configured with a 1G of heap based on these settings. We recommend giving it at least 16G, which is our default, or as much as you can up to 63G. I can not tell how many collectors writing to this cluster but it appears to me as ES is resource starved. ES is a resource intensive application. It can not be expected to index large volumes of logs with no resources.
I advise you modify the resources