"cluster-logging-operator" pod keeps restarting with "fatal error: concurrent map read and map write"

Question

"cluster-logging-operator" pod keeps restarting with "fatal error: concurrent map read and map write"

Closed this issue a month ago · 3 comments

Describe the bug
Hello. We're facing an issue where the "cluster-logging-operator" pod has been restarted 100 times in the past 6 months, always with the same error "fatal error: concurrent map read and map write". Our openshift-logging is configured with a ClusterLogging and a ClusterLogForwarder forwarding logs to three Kafka brokers.

Environment

Versions of OpenShift, Cluster Logging and any other relevant components

Client Version: 4.7.3  
Server Version: 4.10.53  
Kubernetes Version: v1.23.12+8a6bfe4

oc get deployment.apps/cluster-logging-operator -o yaml | grep version
operatorframework.io/properties: '{"properties":[{"type":"olm.gvk","value":{"group":"logging.openshift.io","kind":"ClusterLogForwarder","version":"v1"}},{"type":"olm.gvk","value":{"group":"logging.openshift.io","kind":"ClusterLogging","version":"v1"}},{"type":"olm.maxOpenShiftVersion","value":4.12},{"type":"olm.package","value":{"packageName":"cluster-logging","version":"5.5.9"}}]}'

ClusterLogging instance

apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
  annotations:
    clusterlogging.openshift.io/logforwardingtechpreview: enabled
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"logging.openshift.io/v1","kind":"ClusterLogging","metadata":{"annotations":{"clusterlogging.openshift.io/logforwardingtechpreview":"enabled"},"name":"instance","namespace":"openshift-logging"},"spec":{"collection":{"logs":{"fluentd":{},"type":"fluentd"}},"managementState":"Unmanaged"}}
  creationTimestamp: "2021-07-27T14:40:14Z"
  generation: 5
  managedFields:
  - apiVersion: logging.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:clusterlogging.openshift.io/logforwardingtechpreview: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
      f:spec:
        .: {}
        f:collection:
          .: {}
          f:logs:
            .: {}
            f:fluentd: {}
            f:type: {}
    manager: kubectl-client-side-apply
    operation: Update
    time: "2021-07-27T14:40:53Z"
  - apiVersion: logging.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:collection:
          f:logs:
            f:fluentd:
              f:resources: {}
      f:status:
        .: {}
        f:clusterConditions: {}
        f:collection:
          .: {}
          f:logs:
            .: {}
            f:fluentdStatus:
              .: {}
              f:daemonSet: {}
              f:nodes:
                .: {}
                f:fluentd-2hcrj: {}
                f:fluentd-2kbxm: {}
                f:fluentd-4dg7r: {}
                f:fluentd-4v7qs: {}
                f:fluentd-5jmhk: {}
                f:fluentd-84kkk: {}
                f:fluentd-8dp6m: {}
                f:fluentd-8wncg: {}
                f:fluentd-8wv7k: {}
                f:fluentd-8xrwk: {}
                f:fluentd-47jbr: {}
                f:fluentd-cp8gm: {}
                f:fluentd-f57pt: {}
                f:fluentd-gl8bb: {}
                f:fluentd-gsgm9: {}
                f:fluentd-hmkm9: {}
                f:fluentd-jjjpv: {}
                f:fluentd-lbn4k: {}
                f:fluentd-lkxvh: {}
                f:fluentd-mvq7m: {}
                f:fluentd-n7q9b: {}
                f:fluentd-p7n7x: {}
                f:fluentd-pbjh9: {}
                f:fluentd-rnzn6: {}
                f:fluentd-rrntm: {}
                f:fluentd-s925v: {}
                f:fluentd-t5hsx: {}
                f:fluentd-xg7gq: {}
                f:fluentd-xkhmj: {}
                f:fluentd-xmpht: {}
              f:pods:
                .: {}
                f:failed: {}
                f:notReady: {}
                f:ready: {}
        f:curation: {}
        f:logStore: {}
        f:visualization: {}
    manager: cluster-logging-operator
    operation: Update
    time: "2021-07-27T14:47:12Z"
  - apiVersion: logging.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:managementState: {}
    manager: Mozilla
    operation: Update
    time: "2021-07-27T14:47:37Z"
  name: instance
  namespace: openshift-logging
  resourceVersion: "12835895"
  uid: d29a1c1d-2c74-4e49-928e-62ba89487d84
spec:
  collection:
    logs:
      fluentd: {}
      type: fluentd
  managementState: Unmanaged
status:
  collection:
    logs:
      fluentdStatus:
        daemonSet: fluentd
        nodes:
          fluentd-2hcrj: ocp-master-1.internal-url.org
          fluentd-2kbxm: ocp-master-5.internal-url.org
          fluentd-4v7qs: ocp-worker-4.internal-url.org
          fluentd-5jmhk: ocp-worker-2.internal-url.org
          fluentd-47jbr: ocp-worker-12.internal-url.org
          fluentd-4dg7r: ocp-worker-21.internal-url.org
          fluentd-84kkk: ocp-worker-11.internal-url.org
          fluentd-8dp6m: ocp-worker-1.internal-url.org
          fluentd-8wncg: ocp-worker-17.internal-url.org
          fluentd-8wv7k: ocp-worker-16.internal-url.org
          fluentd-8xrwk: ocp-worker-8.internal-url.org
          fluentd-cp8gm: ocp-worker-10.internal-url.org
          fluentd-f57pt: ocp-worker-18.internal-url.org
          fluentd-gl8bb: ocp-worker-23.internal-url.org
          fluentd-gsgm9: ocp-master-4.internal-url.org
          fluentd-hmkm9: ocp-worker-15.internal-url.org
          fluentd-jjjpv: ocp-worker-22.internal-url.org
          fluentd-lbn4k: ocp-master-3.internal-url.org
          fluentd-lkxvh: ocp-worker-19.internal-url.org
          fluentd-mvq7m: ocp-worker-5.internal-url.org
          fluentd-n7q9b: ocp-worker-3.internal-url.org
          fluentd-p7n7x: ocp-worker-25.internal-url.org
          fluentd-pbjh9: ocp-worker-13.internal-url.org
          fluentd-rnzn6: ocp-master-2.internal-url.org
          fluentd-rrntm: ocp-worker-6.internal-url.org
          fluentd-s925v: ocp-worker-24.internal-url.org
          fluentd-t5hsx: ocp-worker-7.internal-url.org
          fluentd-xg7gq: ocp-worker-14.internal-url.org
          fluentd-xkhmj: ocp-worker-20.internal-url.org
          fluentd-xmpht: ocp-worker-9.internal-url.org
        pods:
          failed: []
          notReady: []
          ready:
          - fluentd-2hcrj
          - fluentd-2kbxm
          - fluentd-47jbr
          - fluentd-4dg7r
          - fluentd-4v7qs
          - fluentd-5jmhk
          - fluentd-84kkk
          - fluentd-8dp6m
          - fluentd-8wncg
          - fluentd-8wv7k
          - fluentd-8xrwk
          - fluentd-cp8gm
          - fluentd-f57pt
          - fluentd-gl8bb
          - fluentd-gsgm9
          - fluentd-hmkm9
          - fluentd-jjjpv
          - fluentd-lbn4k
          - fluentd-lkxvh
          - fluentd-mvq7m
          - fluentd-n7q9b
          - fluentd-p7n7x
          - fluentd-pbjh9
          - fluentd-rnzn6
          - fluentd-rrntm
          - fluentd-s925v
          - fluentd-t5hsx
          - fluentd-xg7gq
          - fluentd-xkhmj
          - fluentd-xmpht
  curation: {}
  logStore: {}
  visualization: {}

Logs
cluster-logging-operator.log

Expected behavior
cluster-logging-operator pod does not crash/restart

Actual behavior
Pod restarts after X amount of time

To Reproduce
Cannot consistently reproduce - pod crashes seemingly at random after X time

Additional context
Happy to provide additional info if necessary. Thank you.

Answer 1 · 2024-05-08T01:00:25.000Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Answer 2 · 2024-06-07T08:30:37.000Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Answer 3 · 2024-06-21T11:54:44.000Z

closing as operator version is EOL