openshift/cluster-logging-operator

fluentd failed to write to elasticsearch with warns and call stack in log after installing OCP logging operators

lihongbj opened this issue · 1 comments

Describe the bug
Fluentd failed to write to elasticsearch with warns and call stack in log after installing OCP logging operators( OpenShift Elasticsearch Operator and Red Hat OpenShift Logging Operator), either with default configmap/collector or with added kong input source.

Environment

  • OCP : 4.10.37
  • 2 logging operator : 5.5.4
  • fluentd: registry.redhat.io/openshift-logging/fluentd-rhel8@sha256:842077788b4434800127d63b4cd5d8cfaa1cfd3ca1dfd8439de30c6e8ebda884
  • elasticsearch plugin in fluentd: 5.2.2
  • elasticsearch : 6.1.8
  • ClusterLogging instance:
apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"logging.openshift.io/v1","kind":"ClusterLogging","metadata":{"annotations":{},"name":"instance","namespace":"openshift-logging"},"spec":{"collection":{"logs":{"fluentd":{},"type":"fluentd"}},"curation":{"curator":{"schedule":"30 3 * * *"},"type":"curator"},"logStore":{"elasticsearch":{"nodeCount":3,"redundancyPolicy":"SingleRedundancy","resources":{"limits":{"memory":"2Gi"},"requests":{"cpu":"200m","memory":"2Gi"}},"storage":{"size":"20G","storageClassName":"rook-cephfs"}},"retentionPolicy":{"application":{"maxAge":"1d"},"audit":{"maxAge":"7d"},"infra":{"maxAge":"7d"}},"type":"elasticsearch"},"managementState":"Managed","visualization":{"kibana":{"replicas":1},"type":"kibana"}}}
  creationTimestamp: "2022-11-22T13:47:15Z"
  generation: 2
  name: instance
  namespace: openshift-logging
  resourceVersion: "62106345"
  uid: a99791f3-0edf-4f66-a81a-ff476a55b978
spec:
  collection:
    logs:
      fluentd: {}
      type: fluentd
  curation:
    curator:
      schedule: 30 3 * * *
    type: curator
  logStore:
    elasticsearch:
      nodeCount: 3
      redundancyPolicy: SingleRedundancy
      resources:
        limits:
          memory: 2Gi
        requests:
          cpu: 200m
          memory: 2Gi
      storage:
        size: 20G
        storageClassName: rook-cephfs
    retentionPolicy:
      application:
        maxAge: 1d
      audit:
        maxAge: 7d
      infra:
        maxAge: 7d
    type: elasticsearch
  managementState: Unmanaged
  visualization:
    kibana:
      replicas: 1
    type: kibana
status:
  collection:
    logs:
      fluentdStatus:
        daemonSet: collector
        nodes:
          collector-5qllx: worker4.mycluster.com
          collector-5vxls: worker2.mycluster.com
          ...
          collector-sphfk: worker3.mycluster.com
        pods:
          failed: []
          notReady: []
          ready:
          - collector-26qt4
          - collector-49k6v
          - ...
          - collector-sphfk
  conditions:
  - lastTransitionTime: "2022-11-22T13:47:33Z"
    status: "False"
    type: CollectorDeadEnd
  curation: {}
  logStore:
    elasticsearchStatus:
    - cluster:
        activePrimaryShards: 11
        activeShards: 22
        initializingShards: 0
        numDataNodes: 3
        numNodes: 3
        pendingTasks: 0
        relocatingShards: 0
        status: green
        unassignedShards: 0
      clusterName: elasticsearch
      nodeConditions:
        elasticsearch-cdm-9e96hery-1: []
        elasticsearch-cdm-9e96hery-2: []
        elasticsearch-cdm-9e96hery-3: []
      nodeCount: 3
      pods:
        client:
          failed: []
          notReady: []
          ready:
          - elasticsearch-cdm-9e96hery-1-555c9fbc65-7pvxs
          - elasticsearch-cdm-9e96hery-2-5bc76fb6d4-bqk2d
          - elasticsearch-cdm-9e96hery-3-8656658955-b4kxf
        data:
          failed: []
          notReady: []
          ready:
          - elasticsearch-cdm-9e96hery-1-555c9fbc65-7pvxs
          - elasticsearch-cdm-9e96hery-2-5bc76fb6d4-bqk2d
          - elasticsearch-cdm-9e96hery-3-8656658955-b4kxf
        master:
          failed: []
          notReady: []
          ready:
          - elasticsearch-cdm-9e96hery-1-555c9fbc65-7pvxs
          - elasticsearch-cdm-9e96hery-2-5bc76fb6d4-bqk2d
          - elasticsearch-cdm-9e96hery-3-8656658955-b4kxf
      shardAllocationEnabled: all
  visualization:
    kibanaStatus:
    - deployment: kibana
      pods:
        failed: []
        notReady: []
        ready:
        - kibana-bdcd6f9c8-lcc2q
      replicaSets:
      - kibana-bdcd6f9c8
      replicas: 1

Logs

    19	2022-11-28 08:02:13 +0000 [warn]: [retry_default] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=20.04924575588666 slow_flush_log_threshold=20.0 plugin_id="retry_default"
    20	2022-11-28 08:02:32 +0000 [warn]: [retry_default] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=36.76959209097549 slow_flush_log_threshold=20.0 plugin_id="retry_default"
    21	2022-11-28 08:02:42 +0000 [warn]: [retry_default] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=27.348451503785327 slow_flush_log_threshold=20.0 plugin_id="retry_default"
2022-11-28 08:06:16 +0000 [warn]: [retry_default] failed to flush the buffer. retry_times=0 next_retry_time=2022-11-28 08:06:18 +0000 chunk="5ee8353c804dadb68de79a7c7db572e2" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch\", :port=>9200, :scheme=>\"https\"}): [502] "
    23	  2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluent-plugin-elasticsearch-5.2.2/lib/fluent/plugin/out_elasticsearch.rb:1139:in `rescue in send_bulk'
    24	  2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluent-plugin-elasticsearch-5.2.2/lib/fluent/plugin/out_elasticsearch.rb:1101:in `send_bulk'
    25	  2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluent-plugin-elasticsearch-5.2.2/lib/fluent/plugin/out_elasticsearch.rb:879:in `block in write'
    26	  2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluent-plugin-elasticsearch-5.2.2/lib/fluent/plugin/out_elasticsearch.rb:878:in `each'
    27	  2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluent-plugin-elasticsearch-5.2.2/lib/fluent/plugin/out_elasticsearch.rb:878:in `write'
    28	  2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:1179:in `try_flush'
    29	  2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:1500:in `flush_thread_run'
    30	  2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:499:in `block (2 levels) in start'
    31	  2022-11-28 08:06:16 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.14.6/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

Expected behavior
fluentd succeeds when to write log to elasticsearch.

Actual behavior
fluentd failed to write to elasticsearch with warns and call stack in log.

To Reproduce
Steps to reproduce the behavior:

  1. install ceph rook storage class and make rook-cephfs as default,
  2. install OpenShift Elasticsearch Operator and Red Hat OpenShift Logging Operator per https://docs.openshift.com/container-platform/4.10/logging/cluster-logging-deploying.html in OCP 4.10.37,
  3. create ClusterLoggings instance with storageClassName as rook-cephfs,
  4. patch ClusterLoggings/instance , change its spec.managementState as Unmanaged
  5. add API gateway kong as one of fluentd log input source into configmap/collector after <system>:
  fluent.conf: |-
    ## CLO GENERATED CONFIGURATION ###
    # This file is a copy of the fluentd configuration entrypoint
    # which should normally be supplied in a configmap.

    <system>
      log_level "#{ENV['LOG_LEVEL'] || 'warn'}"
    </system>
    <source>
      @type udp
      tag kong
      format json
    </source>
    <match kong>
      @type elasticsearch
      host elasticsearch.openshift-logging.svc.cluster.local
      port 9200
      verify_es_version_at_startup false
      scheme https
      ssl_version TLSv1_2
      index_name kong-000001
      client_key '/var/run/ocp-collector/secrets/collector/tls.key'
      client_cert '/var/run/ocp-collector/secrets/collector/tls.crt'
      ca_file '/var/run/ocp-collector/secrets/collector/ca-bundle.crt'
    </match>
  1. after api calling flowed through kong, the kong api log can be found in fluentd log, but elasticsearch error is found followingly as above log

Additional context
No.

Your collector shows it received a "502" which is an indicator there is an issue on the receiving side. It shows you configured a 3 node ES cluster with the following:

  logStore:
    elasticsearch:
      nodeCount: 3
      redundancyPolicy: SingleRedundancy
      resources:
        limits:
          memory: 2Gi
        requests:
          cpu: 200m
          memory: 2Gi

This means you gave ES literally no CPU for which it needs to process logs. Additionally, each instance is configured with a 1G of heap based on these settings. We recommend giving it at least 16G, which is our default, or as much as you can up to 63G. I can not tell how many collectors writing to this cluster but it appears to me as ES is resource starved. ES is a resource intensive application. It can not be expected to index large volumes of logs with no resources.

I advise you modify the resources