Bad upgrade to 5.5
davidkarlsen opened this issue · 10 comments
Describe the bug
after upgrade the operator logs:
{"_ts":"2022-08-19T11:48:36.79799272Z","_level":"0","_component":"cluster-logging-operator","_message":"starting up...","go_arch":"amd64","go_os":"linux","go_version":"go1.17.12","operator_version":"5.5"}
I0819 11:48:38.911587 1 request.go:665] Waited for 1.042147194s due to client-side throttling, not priority and fairness, request: GET:https://10.201.0.1:443/apis/acme.cert-manager.io/v1beta1?timeout=32s
{"_ts":"2022-08-19T11:48:43.221394033Z","_level":"0","_component":"cluster-logging-operator","_message":"Registering Components."}
{"_ts":"2022-08-19T11:48:43.223196743Z","_level":"0","_component":"cluster-logging-operator","_message":"Starting the Cmd."}
{"_ts":"2022-08-19T11:48:55.758739899Z","_level":"0","_component":"cluster-logging-operator","_message":"Error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:48:55.758826923Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterRequest.generateCollectorConfig","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:48:55.758873883Z","_level":"0","_component":"cluster-logging-operator","_message":"Error reconciling clusterlogging instance","_error":{"msg":"unable to create or update collection for \"instance\": Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:48:55.767222602Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:48:56.028545585Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogforwarder-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterlogforwarders.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:08.159704775Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:08.182416585Z","_level":"0","_component":"cluster-logging-operator","_message":"Could not find Secret","Name":"logcollector-token","_error":{"msg":"Secret \"logcollector-token\" not found"}}
{"_ts":"2022-08-19T11:49:20.627172516Z","_level":"0","_component":"cluster-logging-operator","_message":"Error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:20.722899424Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:20.757861103Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogforwarder-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterlogforwarders.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:33.018600771Z","_level":"0","_component":"cluster-logging-operator","_message":"Error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:33.041510705Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:57.398605318Z","_level":"0","_component":"cluster-logging-operator","_message":"Error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:49:57.70251097Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:50:09.63328512Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogforwarder-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterlogforwarders.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:50:10.101315086Z","_level":"0","_component":"cluster-logging-operator","_message":"Error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:50:10.141157176Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:50:22.491985682Z","_level":"0","_component":"cluster-logging-operator","_message":"Error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:50:22.550220608Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:50:34.817811937Z","_level":"0","_component":"cluster-logging-operator","_message":"Error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
{"_ts":"2022-08-19T11:50:34.848234923Z","_level":"0","_component":"cluster-logging-operator","_message":"clusterlogging-controller error updating status","_error":{"msg":"Operation cannot be fulfilled on clusterloggings.logging.openshift.io \"instance\": the object has been modified; please apply your changes to the latest version and try again"}}
and the collector pods are constantly recreated
Environment
- OCP 4.10, upgrade cluster logging operator to 5.5 from 5.4
- ClusterLogging instance [*]
[*]
cat cluster-logging.yaml
apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
name: instance
namespace: openshift-logging
spec:
managementState: Managed
logStore:
type: elasticsearch
retentionPolicy:
application:
maxAge: 14d
infra:
maxAge: 14d
audit:
maxAge: 14d
elasticsearch:
nodeCount: 3
tolerations:
- key: logging
operator: Exists
effect: NoExecute
nodeSelector:
node-role.kubernetes.io/cluster-logging: ''
storage:
storageClassName: openebs-lvmpv-ext4
size: 450Gi
resources:
limits:
memory: 24Gi
requests:
cpu: 2
memory: 16Gi
proxy:
resources:
limits:
memory: 256Mi
requests:
memory: 256Mi
redundancyPolicy: SingleRedundancy
visualization:
type: kibana
kibana:
tolerations:
- key: "logging"
operator: "Exists"
effect: "NoExecute"
nodeSelector:
node-role.kubernetes.io/cluster-logging: ''
replicas: 1
collection:
logs:
type: fluentd
fluentd:
tolerations:
- operator: Exists
resources:
limits:
memory: 2Gi
requests:
cpu: 100m
memory: 1Gi
Logs
See desc
Expected behavior
A clear and concise description of what you expected to happen.
Actual behavior
A clear and concise description of what actually happened.
To Reproduce
Steps to reproduce the behavior:
- upgrade from 5.4
Additional context
Add any other context about the problem here.
collector pods are constantly recreated
This is a known issue but the collector pods eventually settle into a steady-state condition. Is this the behavior you see as well?
What exactly is the problem you are reporting? Is it you believe the CLO logs are too verbose? Do the log messages subside once the deployment reaches a steady condition?
collector pods are constantly recreated
This is a known issue but the collector pods eventually settle into a steady-state condition. Is this the behavior you see as well?
No - the daemonset gets deleted and crated over and over
What exactly is the problem you are reporting? Is it you believe the CLO logs are too verbose? Do the log messages subside once the deployment reaches a steady condition?
I am reporting what I reported "and the collector pods are constantly recreated".
I would expect the operator to create a daemonset and keep it running.
Hello,
we have the same issue on two clusters. I one is using the default resource settings for the collector pods this can bring a cluster down due to the 100m cpu requests and possibly thousands of pods getting re-created all the time.
We had the same problem on two OKD 4.11 clusters. I had to mitigate the issue by changing our ClusterLogging/instance
managementState
from Managed
to Unmanaged
. For the first cluster it did not resolve after hours of attempting to do so. For the second I switched to UnManaged after about 15 minutes. Today I tried again by changing from Unmanaged
to Managed
and it resolved after a few "cycles" like https://issues.redhat.com/browse/LOG-2789 suggests.
It seems that this is an ongoing problem for us. Last night, during a period where the etcd and API endpoints were reported slow, this problem returned. I'm not sure if the slow responses or the constant grind of objects from collector recreation was the original problem but they were happening at the same time by the time I looked at the problem.
This is the only known issue we have https://issues.redhat.com/browse/LOG-2789 where the collectors get created exactly twice upon initial deployment of the operators. I have not seen any other behaviors that would lead to this. The operator will try to reconcile state if:
- collector config changes (e.g. logforwarder is modified)
- secrets associated with CLF are modified
- CL resources/tolerations/nodeSelectors change
- Manual edit of the DS that is unexpected by the operator
Our clusters are logging this. Which looks to compare the node part of the spec to the nodes. I don't see any recent changes in code related to that code path so perhaps it has been logging for a while and is not related.
Our clusters are logging this. Which looks to compare the node part of the spec to the nodes. I don't see any recent changes in code related to that code path so perhaps it has been logging for a while and is not related.
I can only imagine this would be relevant if something is constantly updating the expected node count for Elasticsearch, the elasticsearch-operator is continually bouncing the ES deployment so it is not ready, and the collectors are terminating because they can't establish a connection.
It may be worth investigating who or what is trying to modify the Elasticsearch resource. That is owned and managed by the CLO
Could it be either that something in the deep comparison between desired and current nodes differs that is not reconciled by the reconciliation? Or when the API call to read the current status or desired state fails it considers this to be a difference and tries to reconcile?
Closing in favor of https://issues.redhat.com/browse/LOG-3049