openshift/cluster-logging-operator

cluster logging operator collector ceases to gather logs in okd 4.11/4.12

nate-duke opened this issue · 11 comments

Describe the bug
collector doesn't seem to work on okd4.X clusters using fedora coreos

Environment

  • cluster-logging.v5.7.0
  • OKD: 4.11.0-0.okd-2023-01-14-152430
  • OKD: 4.12.0-0.okd-2023-04-16-041331
ClusterLogging / ClusterLogForwarder Objects
apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"logging.openshift.io/v1","kind":"ClusterLogging","metadata":{"annotations":{},"name":"instance","namespace":"openshift-logging"},"spec":{"collection":{"logs":{"fluentd":{},"type":"fluentd"}},"managementState":"Managed"}}
  creationTimestamp: "2022-09-08T14:31:37Z"
  generation: 3
  name: instance
  namespace: openshift-logging
  resourceVersion: "419441096"
  uid: 7f714032-c5d0-48b6-bdb4-1032baf57977
spec:
  collection:
    logs:
      fluentd:
        tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/gitlab-runner
          operator: Exists
        - effect: NoSchedule
          key: node-role.kubernetes.io/infra
          operator: Exists
        - effect: NoSchedule
          key: node-role.kubernetes.io/worker
          operator: Exists
      type: fluentd
    type: vector
  managementState: Managed
status:
  collection:
    logs:
      fluentdStatus:
        daemonSet: collector
        nodes:
          collector-2k8zm: os-worker-fitz-prod-01-rmbtk
          collector-6mrg9: prod-8d8h6-master-0
          collector-8vzjh: os-worker-fitz-prod-01-df6mq
          collector-582q8: os-infra-prod-01-klkc2
          collector-9727x: os-worker-fitz-prod-01-c7vdg
          collector-b9pvk: os-worker-fitz-prod-01-nngtw
          collector-bd467: os-worker-fitz-prod-01-78jz7
          collector-bkddl: os-worker-fitz-prod-01-gp86j
          collector-bxtvb: os-worker-fitz-prod-01-mq4xx
          collector-g7vf9: os-gitlab-runner-fitz-prod-01-k5wq6
          collector-g456p: os-infra-prod-01-g57g5
          collector-h5rsj: os-worker-fitz-prod-01-pfgnp
          collector-ht62n: os-worker-fitz-prod-01-p9n9m
          collector-jl6ps: os-worker-fitz-prod-01-dzsnk
          collector-jxn4w: prod-8d8h6-master-2
          collector-jzjd6: os-worker-fitz-prod-01-fj8zk
          collector-lspnc: os-worker-fitz-prod-01-ttkkq
          collector-mn7rs: os-worker-fitz-prod-01-8gbgx
          collector-nsp76: os-worker-fitz-prod-01-bqz25
          collector-p7xmh: os-gitlab-runner-fitz-prod-01-cmgmh
          collector-rddll: os-infra-prod-01-ch2mr
          collector-sgbmt: os-worker-fitz-prod-01-wjxn8
          collector-tmht6: prod-8d8h6-master-1
          collector-tqmgg: os-worker-fitz-prod-01-6qfrj
          collector-v7x6g: os-worker-fitz-prod-01-65v8n
          collector-wz44b: os-worker-fitz-prod-01-vt2tk
          collector-xgsgn: os-worker-fitz-prod-01-kfbrh
        pods:
          failed: []
          notReady: []
          ready:
          - collector-2k8zm
          - collector-582q8
          - collector-6mrg9
          - collector-8vzjh
          - collector-9727x
          - collector-b9pvk
          - collector-bd467
          - collector-bkddl
          - collector-bxtvb
          - collector-g456p
          - collector-g7vf9
          - collector-h5rsj
          - collector-ht62n
          - collector-jl6ps
          - collector-jxn4w
          - collector-jzjd6
          - collector-lspnc
          - collector-mn7rs
          - collector-nsp76
          - collector-p7xmh
          - collector-rddll
          - collector-sgbmt
          - collector-tmht6
          - collector-tqmgg
          - collector-v7x6g
          - collector-wz44b
          - collector-xgsgn
  conditions:
  - lastTransitionTime: "2022-09-08T14:31:56Z"
    status: "False"
    type: CollectorDeadEnd
  - lastTransitionTime: "2022-09-08T14:31:44Z"
    message: curator is deprecated in favor of defining retention policy
    reason: ResourceDeprecated
    status: "True"
    type: CuratorRemoved
  curation: {}
  logStore: {}
  visualization: {}
apiVersion: logging.openshift.io/v1
kind: ClusterLogForwarder
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"logging.openshift.io/v1","kind":"ClusterLogForwarder","metadata":{"annotations":{},"name":"instance","namespace":"openshift-logging"},"spec":{"outputs":[{"name":"oit-ssi-fluentd","type":"fluentdForward","url":"tcp://input.oit-ssi-fluentd.svc.cluster.local:24224"}],"pipelines":[{"inputRefs":["application","infrastructure","audit"],"name":"forward-to-remote","outputRefs":["oit-ssi-fluentd"]}]},"status":{}}
  creationTimestamp: "2022-09-08T14:23:23Z"
  generation: 1
  name: instance
  namespace: openshift-logging
  resourceVersion: "419709054"
  uid: 3bdce621-25fa-4fc6-8397-e1e98bb52d0f
spec:
  outputs:
  - name: oit-ssi-fluentd
    type: fluentdForward
    url: tcp://input.oit-ssi-fluentd.svc.cluster.local:24224
  pipelines:
  - inputRefs:
    - application
    - infrastructure
    - audit
    name: forward-to-remote
    outputRefs:
    - oit-ssi-fluentd
status:
  conditions:
  - lastTransitionTime: "2023-05-10T11:55:13Z"
    status: "True"
    type: Ready
  outputs:
    oit-ssi-fluentd:
    - lastTransitionTime: "2023-05-10T11:55:13Z"
      status: "True"
      type: Ready
  pipelines:
    forward-to-remote:
    - lastTransitionTime: "2023-05-10T11:55:13Z"
      status: "True"
      type: Ready

Gist of logs from one collector pod
https://gist.github.com/nate-duke/a0db0224737d56e276731edb7fda6f84

Expected behavior
logs get forwarded where the cluster log forwarder tells them to go.

Actual behavior
No logs are forwarded.

To Reproduce
Steps to reproduce the behavior:

  1. have working collectors fowarding logs
  2. update clusterloggingoperator to 5.7.0

Please let me know if you need any further information. I have two clusters that are in the exact same state following this update. Their versions are reflected above. Both are using v5.7.0 of this operator and have ceased forwarding logs.

Similar issues, symptoms: Proxy instance of ElasticSearch is crashing due to OOMKill, even after doubling the memory requirement.

I was able to remove the 5.7 install and reinstall using the 5.6-stable channel and set the update policy to manual. I did have to reinstall my clusterlogging and clusterlogforwarding objects at least to get the vector configuration to regenerate. I suspect that our issue is somewhere in that generated vector config but I'm not sure.

We don't have any Elasticsearch. Learned that lesson the hard way in okd 3.11! ;)

  • What is the OS version of the nodes?
  • What is the pull spec of the node image?
  • What is the pull spec of the collector image being used?

We resolved an issue where the RHEL8 based collectors were unable to read journal on RHEL9 nodes but this required a change in the node images. cc @sdodson

The nodes are:
for 4.12:

sh-5.2# cat /etc/redhat-release
Fedora release 37 (Thirty Seven)
sh-5.2# rpm-ostree status
State: idle
Deployments:
* ostree-unverified-registry:quay.io/openshift/okd-content@sha256:53abd41811f53c940ac41dc8b9d5475813d9c96e20ce893c8664ba951044d6a3
                   Digest: sha256:53abd41811f53c940ac41dc8b9d5475813d9c96e20ce893c8664ba951044d6a3
                  Version: 37.20230322.3.0 (2023-04-25T13:19:42Z)

for 4.11:

sh-5.2# cat /etc/redhat-release
Fedora release 36 (Thirty Six)
sh-5.2# rpm-ostree status
State: idle
Deployments:
* pivot://quay.io/openshift/okd-content@sha256:bc4fe370cd76415d045b6cc2cf08e5f696ece912661cfe4370910020be9fe0b6
             CustomOrigin: Managed by machine-config-operator
                  Version: 411.36.202301141513-0 (2023-01-14T15:17:08Z)

I don't have the collector image from 5.7 deployed any longer as I had to roll back to get logs flowing again. I can probably re-upgrade our dev cluster to see what it says but i'd imagine it's whatever is in the catalog for the collector pods. They were all fresh as of the operator update over night.

ETA: I let our dev cluster (okd 4.12.0-0.okd-2023-04-16-041331/FCOS 37) go back up to 5.7 and am willing to leave it there couple hours if there's more diagnostics needed.

collector image pull spec:

image: registry.redhat.io/openshift-logging/vector-rhel8@sha256:4a2f9ca8c3379a8c68a832b9d6d83329352aa98352f1736f0cb8163b7636fad5

@jcantrill The problem we ran into in OCP 4.13 is that we moved the host OS from RHEL 8.6 to 9.2 and as a result we brought in all of the systemd changes between systemd-239 and systemd-252 which included several changes to the journal on disk format which meant that journalctl running inside a UBI8 image could no longer read it. What we did is to patch systemd in OCP 4.13 and push upstream to make lz4 compression possible. Together with some other compatibility environment variables we were able to make RHEL 9.2 journald compatible with UBI8 journalctl.

This seems different given they're saying that the problem occurs when moving from Logging 5.6 to 5.7, unless of course 5.6 somehow used a Fedora aligned base image but 5.7 uses a UBI8 base image.

AFAICT on Logging 5.6:

image: registry.redhat.io/openshift-logging/fluentd-rhel8@sha256:7d60ecaac129cb76277901ff87a6005bf07d160b0d2d73988894a5530bad2dfe

and

❯ kubectl --context=4prod exec -it daemonset/collector -- cat /etc/redhat-release
Defaulted container "collector" out of: collector, logfilesmetricexporter
Red Hat Enterprise Linux release 8.7 (Ootpa)

with Logging 5.7:

image: registry.redhat.io/openshift-logging/vector-rhel8@sha256:4a2f9ca8c3379a8c68a832b9d6d83329352aa98352f1736f0cb8163b7636fad5

and

❯ kubectl --context=4dev exec -it daemonset/collector -- cat /etc/redhat-release
Defaulted container "collector" out of: collector, logfilesmetricexporter
Red Hat Enterprise Linux release 8.7 (Ootpa)

So it seems to me the containers are based on that UBI8 base.

The underlying nodes writing the journals have the following systemd:

okd-4.11 (FCOS 36):

sh-5.2# rpm -qa systemd
systemd-250.9-1.fc36.x86_64

okd-4.12 (FCOS 37):

sh-5.2# rpm -q systemd
systemd-251.13-6.fc37.x86_64

Not sure what okd could be doing to the nodes systemd configuration between the two that could interfere with vector being able to read the files but I'm happy to post this issue wherever is appropriate.

crikke commented

We had the same issue after automatic update to 4.7. Checking the ClusterLogging Instance, I saw that the collector had switched from using collection type FluentD to Vector.

Reverting to FluentD resolved the issue.

apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
  name: instance
  namespace: openshift-logging
  labels:
    app.kubernetes.io/instance: logging
spec:
  collection:
    logs:
      fluentd: {}
      type: fluentd
      ....

@crikke ... you have nailed it.

The update to 5.7 from 5.6.5 modified the clusterlogging object to CHANGE(!!) spec.collection.type from "fluentd" to "vector". Flipping that back to the original setting makes it work fine on 5.7.

Am I incorrect in thinking that this is undesirable behaviour? I certainly wouldn't expect it.

Seems what the update does is ADD spec.collection.type and set it to "vector" if previously unspecified. The documentation doesn't include that field, nor does the Release Notes for 5.7. It is gestured at in the Getting Started with Logging 5.7 doc.

It's probably worth noting somewhere in the Release notes or at least failing update if the Collector Preference prerequisite is unset. Or defaulting to what was previously the default?

Safe to close this issue if the maintainers feel this is not a bug/undesirable behaviour.

crikke commented

@nate-duke Yeah, chaning default behaviour in a minor release feels a bit sketchy. Maintainers should probably revert to the default behavour and mark that it will be depreciated. Then switch to vector as default in 5.0.0

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

fixed by #2002