openshift/cluster-logging-operator

Bad upgrade to 5.5

davidkarlsen opened this issue · 10 comments

Describe the bug
after upgrade the operator logs:

{"_ts":"2022-08-19T11:48:36.79799272Z","_level":"0","_component":"cluster-logging-operator","_message":"starting up...","go_arch":"amd64","go_os":"linux","go_version":"go1.17.12","operator_version":"5.5"}
I0819 11:48:38.911587 1 request.go:665] Waited for 1.042147194s due to client-side throttling, not priority and fairness, request: GET:https://10.201.0.1:443/apis/acme.cert-manager.io/v1beta1?timeout=32s
{"_ts":"2022-08-19T11:48:43.221394033Z","_level":"0","_component":"cluster-logging-operator","_message":"Registering Components."}
{"_ts":"2022-08-19T11:48:43.223196743Z","_level":"0","_component":"cluster-logging-operator","_message":"Starting the Cmd."}
{"_ts":"2022-08-19T11:48:55.758739899Z","_level":"0","_component":"cluster-logging-operator","_message":"Error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:48:55.758826923Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterRequest.generateCollectorConfig","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:48:55.758873883Z","_level":"0","_component":"cluster-logging-operator","_message":"Error reconciling clusterlogging instance","_error":{"msg":"unable to create or update collection for \"instance\": Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:48:55.767222602Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:48:56.028545585Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogforwarder-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterlogforwarders.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:08.159704775Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:08.182416585Z","_level":"0","_component":"cluster-logging-operator","_message":"Could not find Secret","Name":"logcollector-token","_error":{"msg":"Secret \"logcollector-token\" not found"}}
{"_ts":"2022-08-19T11:49:20.627172516Z","_level":"0","_component":"cluster-logging-operator","_message":"Error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:20.722899424Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:20.757861103Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogforwarder-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterlogforwarders.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:33.018600771Z","_level":"0","_component":"cluster-logging-operator","_message":"Error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:33.041510705Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:57.398605318Z","_level":"0","_component":"cluster-logging-operator","_message":"Error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:57.70251097Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:50:09.63328512Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogforwarder-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterlogforwarders.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:50:10.101315086Z","_level":"0","_component":"cluster-logging-operator","_message":"Error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:50:10.141157176Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:50:22.491985682Z","_level":"0","_component":"cluster-logging-operator","_message":"Error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:50:22.550220608Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:50:34.817811937Z","_level":"0","_component":"cluster-logging-operator","_message":"Error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:50:34.848234923Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}

and the collector pods are constantly recreated

Environment

  • OCP 4.10, upgrade cluster logging operator to 5.5 from 5.4
  • ClusterLogging instance [*]

[*]

cat cluster-logging.yaml 
apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
  name: instance
  namespace: openshift-logging
spec:
  managementState: Managed
  logStore:
    type: elasticsearch
    retentionPolicy:
      application:
        maxAge: 14d
      infra:
        maxAge: 14d
      audit:
        maxAge: 14d
    elasticsearch:
      nodeCount: 3
      tolerations:
      - key: logging
        operator: Exists
        effect: NoExecute
      nodeSelector:
        node-role.kubernetes.io/cluster-logging: ''
      storage:
        storageClassName: openebs-lvmpv-ext4
        size: 450Gi
      resources:
        limits:
          memory: 24Gi
        requests:
          cpu: 2
          memory: 16Gi
      proxy: 
        resources:
          limits:
            memory: 256Mi
          requests:
            memory: 256Mi
      redundancyPolicy: SingleRedundancy
  visualization:
    type: kibana
    kibana:
      tolerations:
      - key: "logging"
        operator: "Exists"
        effect: "NoExecute"
      nodeSelector:
        node-role.kubernetes.io/cluster-logging: ''
      replicas: 1
  collection:
    logs:
      type: fluentd
      fluentd:
        tolerations:
        - operator: Exists
        resources:
          limits:
            memory: 2Gi
          requests:
            cpu: 100m
            memory: 1Gi

Logs
See desc

Expected behavior
A clear and concise description of what you expected to happen.

Actual behavior
A clear and concise description of what actually happened.

To Reproduce
Steps to reproduce the behavior:

  1. upgrade from 5.4

Additional context
Add any other context about the problem here.

collector pods are constantly recreated

This is a known issue but the collector pods eventually settle into a steady-state condition. Is this the behavior you see as well?

What exactly is the problem you are reporting? Is it you believe the CLO logs are too verbose? Do the log messages subside once the deployment reaches a steady condition?

collector pods are constantly recreated

This is a known issue but the collector pods eventually settle into a steady-state condition. Is this the behavior you see as well?

No - the daemonset gets deleted and crated over and over

What exactly is the problem you are reporting? Is it you believe the CLO logs are too verbose? Do the log messages subside once the deployment reaches a steady condition?

I am reporting what I reported "and the collector pods are constantly recreated".
I would expect the operator to create a daemonset and keep it running.

Hello,

we have the same issue on two clusters. I one is using the default resource settings for the collector pods this can bring a cluster down due to the 100m cpu requests and possibly thousands of pods getting re-created all the time.

We had the same problem on two OKD 4.11 clusters. I had to mitigate the issue by changing our ClusterLogging/instance managementState from Managed to Unmanaged. For the first cluster it did not resolve after hours of attempting to do so. For the second I switched to UnManaged after about 15 minutes. Today I tried again by changing from Unmanaged to Managed and it resolved after a few "cycles" like https://issues.redhat.com/browse/LOG-2789 suggests.

It seems that this is an ongoing problem for us. Last night, during a period where the etcd and API endpoints were reported slow, this problem returned. I'm not sure if the slow responses or the constant grind of objects from collector recreation was the original problem but they were happening at the same time by the time I looked at the problem.

This is the only known issue we have https://issues.redhat.com/browse/LOG-2789 where the collectors get created exactly twice upon initial deployment of the operators. I have not seen any other behaviors that would lead to this. The operator will try to reconcile state if:

  • collector config changes (e.g. logforwarder is modified)
  • secrets associated with CLF are modified
  • CL resources/tolerations/nodeSelectors change
  • Manual edit of the DS that is unexpected by the operator

Our clusters are logging this. Which looks to compare the node part of the spec to the nodes. I don't see any recent changes in code related to that code path so perhaps it has been logging for a while and is not related.

Our clusters are logging this. Which looks to compare the node part of the spec to the nodes. I don't see any recent changes in code related to that code path so perhaps it has been logging for a while and is not related.

I can only imagine this would be relevant if something is constantly updating the expected node count for Elasticsearch, the elasticsearch-operator is continually bouncing the ES deployment so it is not ready, and the collectors are terminating because they can't establish a connection.

It may be worth investigating who or what is trying to modify the Elasticsearch resource. That is owned and managed by the CLO

Could it be either that something in the deep comparison between desired and current nodes differs that is not reconciled by the reconciliation? Or when the API call to read the current status or desired state fails it considers this to be a difference and tries to reconcile?