cluster logging operator collector ceases to gather logs in okd 4.11/4.12
nate-duke opened this issue · 11 comments
Describe the bug
collector doesn't seem to work on okd4.X clusters using fedora coreos
Environment
- cluster-logging.v5.7.0
- OKD: 4.11.0-0.okd-2023-01-14-152430
- OKD: 4.12.0-0.okd-2023-04-16-041331
ClusterLogging / ClusterLogForwarder Objects
apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"logging.openshift.io/v1","kind":"ClusterLogging","metadata":{"annotations":{},"name":"instance","namespace":"openshift-logging"},"spec":{"collection":{"logs":{"fluentd":{},"type":"fluentd"}},"managementState":"Managed"}}
creationTimestamp: "2022-09-08T14:31:37Z"
generation: 3
name: instance
namespace: openshift-logging
resourceVersion: "419441096"
uid: 7f714032-c5d0-48b6-bdb4-1032baf57977
spec:
collection:
logs:
fluentd:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/gitlab-runner
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/infra
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/worker
operator: Exists
type: fluentd
type: vector
managementState: Managed
status:
collection:
logs:
fluentdStatus:
daemonSet: collector
nodes:
collector-2k8zm: os-worker-fitz-prod-01-rmbtk
collector-6mrg9: prod-8d8h6-master-0
collector-8vzjh: os-worker-fitz-prod-01-df6mq
collector-582q8: os-infra-prod-01-klkc2
collector-9727x: os-worker-fitz-prod-01-c7vdg
collector-b9pvk: os-worker-fitz-prod-01-nngtw
collector-bd467: os-worker-fitz-prod-01-78jz7
collector-bkddl: os-worker-fitz-prod-01-gp86j
collector-bxtvb: os-worker-fitz-prod-01-mq4xx
collector-g7vf9: os-gitlab-runner-fitz-prod-01-k5wq6
collector-g456p: os-infra-prod-01-g57g5
collector-h5rsj: os-worker-fitz-prod-01-pfgnp
collector-ht62n: os-worker-fitz-prod-01-p9n9m
collector-jl6ps: os-worker-fitz-prod-01-dzsnk
collector-jxn4w: prod-8d8h6-master-2
collector-jzjd6: os-worker-fitz-prod-01-fj8zk
collector-lspnc: os-worker-fitz-prod-01-ttkkq
collector-mn7rs: os-worker-fitz-prod-01-8gbgx
collector-nsp76: os-worker-fitz-prod-01-bqz25
collector-p7xmh: os-gitlab-runner-fitz-prod-01-cmgmh
collector-rddll: os-infra-prod-01-ch2mr
collector-sgbmt: os-worker-fitz-prod-01-wjxn8
collector-tmht6: prod-8d8h6-master-1
collector-tqmgg: os-worker-fitz-prod-01-6qfrj
collector-v7x6g: os-worker-fitz-prod-01-65v8n
collector-wz44b: os-worker-fitz-prod-01-vt2tk
collector-xgsgn: os-worker-fitz-prod-01-kfbrh
pods:
failed: []
notReady: []
ready:
- collector-2k8zm
- collector-582q8
- collector-6mrg9
- collector-8vzjh
- collector-9727x
- collector-b9pvk
- collector-bd467
- collector-bkddl
- collector-bxtvb
- collector-g456p
- collector-g7vf9
- collector-h5rsj
- collector-ht62n
- collector-jl6ps
- collector-jxn4w
- collector-jzjd6
- collector-lspnc
- collector-mn7rs
- collector-nsp76
- collector-p7xmh
- collector-rddll
- collector-sgbmt
- collector-tmht6
- collector-tqmgg
- collector-v7x6g
- collector-wz44b
- collector-xgsgn
conditions:
- lastTransitionTime: "2022-09-08T14:31:56Z"
status: "False"
type: CollectorDeadEnd
- lastTransitionTime: "2022-09-08T14:31:44Z"
message: curator is deprecated in favor of defining retention policy
reason: ResourceDeprecated
status: "True"
type: CuratorRemoved
curation: {}
logStore: {}
visualization: {}
apiVersion: logging.openshift.io/v1
kind: ClusterLogForwarder
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"logging.openshift.io/v1","kind":"ClusterLogForwarder","metadata":{"annotations":{},"name":"instance","namespace":"openshift-logging"},"spec":{"outputs":[{"name":"oit-ssi-fluentd","type":"fluentdForward","url":"tcp://input.oit-ssi-fluentd.svc.cluster.local:24224"}],"pipelines":[{"inputRefs":["application","infrastructure","audit"],"name":"forward-to-remote","outputRefs":["oit-ssi-fluentd"]}]},"status":{}}
creationTimestamp: "2022-09-08T14:23:23Z"
generation: 1
name: instance
namespace: openshift-logging
resourceVersion: "419709054"
uid: 3bdce621-25fa-4fc6-8397-e1e98bb52d0f
spec:
outputs:
- name: oit-ssi-fluentd
type: fluentdForward
url: tcp://input.oit-ssi-fluentd.svc.cluster.local:24224
pipelines:
- inputRefs:
- application
- infrastructure
- audit
name: forward-to-remote
outputRefs:
- oit-ssi-fluentd
status:
conditions:
- lastTransitionTime: "2023-05-10T11:55:13Z"
status: "True"
type: Ready
outputs:
oit-ssi-fluentd:
- lastTransitionTime: "2023-05-10T11:55:13Z"
status: "True"
type: Ready
pipelines:
forward-to-remote:
- lastTransitionTime: "2023-05-10T11:55:13Z"
status: "True"
type: Ready
Gist of logs from one collector pod
https://gist.github.com/nate-duke/a0db0224737d56e276731edb7fda6f84
Expected behavior
logs get forwarded where the cluster log forwarder tells them to go.
Actual behavior
No logs are forwarded.
To Reproduce
Steps to reproduce the behavior:
- have working collectors fowarding logs
- update clusterloggingoperator to 5.7.0
Please let me know if you need any further information. I have two clusters that are in the exact same state following this update. Their versions are reflected above. Both are using v5.7.0 of this operator and have ceased forwarding logs.
Similar issues, symptoms: Proxy instance of ElasticSearch is crashing due to OOMKill, even after doubling the memory requirement.
I was able to remove the 5.7 install and reinstall using the 5.6-stable channel and set the update policy to manual. I did have to reinstall my clusterlogging and clusterlogforwarding objects at least to get the vector configuration to regenerate. I suspect that our issue is somewhere in that generated vector config but I'm not sure.
We don't have any Elasticsearch. Learned that lesson the hard way in okd 3.11! ;)
- What is the OS version of the nodes?
- What is the pull spec of the node image?
- What is the pull spec of the collector image being used?
We resolved an issue where the RHEL8 based collectors were unable to read journal on RHEL9 nodes but this required a change in the node images. cc @sdodson
The nodes are:
for 4.12:
sh-5.2# cat /etc/redhat-release
Fedora release 37 (Thirty Seven)
sh-5.2# rpm-ostree status
State: idle
Deployments:
* ostree-unverified-registry:quay.io/openshift/okd-content@sha256:53abd41811f53c940ac41dc8b9d5475813d9c96e20ce893c8664ba951044d6a3
Digest: sha256:53abd41811f53c940ac41dc8b9d5475813d9c96e20ce893c8664ba951044d6a3
Version: 37.20230322.3.0 (2023-04-25T13:19:42Z)
for 4.11:
sh-5.2# cat /etc/redhat-release
Fedora release 36 (Thirty Six)
sh-5.2# rpm-ostree status
State: idle
Deployments:
* pivot://quay.io/openshift/okd-content@sha256:bc4fe370cd76415d045b6cc2cf08e5f696ece912661cfe4370910020be9fe0b6
CustomOrigin: Managed by machine-config-operator
Version: 411.36.202301141513-0 (2023-01-14T15:17:08Z)
I don't have the collector image from 5.7 deployed any longer as I had to roll back to get logs flowing again. I can probably re-upgrade our dev cluster to see what it says but i'd imagine it's whatever is in the catalog for the collector pods. They were all fresh as of the operator update over night.
ETA: I let our dev cluster (okd 4.12.0-0.okd-2023-04-16-041331/FCOS 37) go back up to 5.7 and am willing to leave it there couple hours if there's more diagnostics needed.
collector image pull spec:
image: registry.redhat.io/openshift-logging/vector-rhel8@sha256:4a2f9ca8c3379a8c68a832b9d6d83329352aa98352f1736f0cb8163b7636fad5
@jcantrill The problem we ran into in OCP 4.13 is that we moved the host OS from RHEL 8.6 to 9.2 and as a result we brought in all of the systemd changes between systemd-239 and systemd-252 which included several changes to the journal on disk format which meant that journalctl
running inside a UBI8 image could no longer read it. What we did is to patch systemd in OCP 4.13 and push upstream to make lz4 compression possible. Together with some other compatibility environment variables we were able to make RHEL 9.2 journald compatible with UBI8 journalctl
.
This seems different given they're saying that the problem occurs when moving from Logging 5.6 to 5.7, unless of course 5.6 somehow used a Fedora aligned base image but 5.7 uses a UBI8 base image.
AFAICT on Logging 5.6:
image: registry.redhat.io/openshift-logging/fluentd-rhel8@sha256:7d60ecaac129cb76277901ff87a6005bf07d160b0d2d73988894a5530bad2dfe
and
❯ kubectl --context=4prod exec -it daemonset/collector -- cat /etc/redhat-release
Defaulted container "collector" out of: collector, logfilesmetricexporter
Red Hat Enterprise Linux release 8.7 (Ootpa)
with Logging 5.7:
image: registry.redhat.io/openshift-logging/vector-rhel8@sha256:4a2f9ca8c3379a8c68a832b9d6d83329352aa98352f1736f0cb8163b7636fad5
and
❯ kubectl --context=4dev exec -it daemonset/collector -- cat /etc/redhat-release
Defaulted container "collector" out of: collector, logfilesmetricexporter
Red Hat Enterprise Linux release 8.7 (Ootpa)
So it seems to me the containers are based on that UBI8 base.
The underlying nodes writing the journals have the following systemd:
okd-4.11 (FCOS 36):
sh-5.2# rpm -qa systemd
systemd-250.9-1.fc36.x86_64
okd-4.12 (FCOS 37):
sh-5.2# rpm -q systemd
systemd-251.13-6.fc37.x86_64
Not sure what okd could be doing to the nodes systemd configuration between the two that could interfere with vector being able to read the files but I'm happy to post this issue wherever is appropriate.
We had the same issue after automatic update to 4.7. Checking the ClusterLogging Instance, I saw that the collector had switched from using collection type FluentD to Vector.
Reverting to FluentD resolved the issue.
apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
name: instance
namespace: openshift-logging
labels:
app.kubernetes.io/instance: logging
spec:
collection:
logs:
fluentd: {}
type: fluentd
....
@crikke ... you have nailed it.
The update to 5.7 from 5.6.5 modified the clusterlogging object to CHANGE(!!) spec.collection.type from "fluentd" to "vector". Flipping that back to the original setting makes it work fine on 5.7.
Am I incorrect in thinking that this is undesirable behaviour? I certainly wouldn't expect it.
Seems what the update does is ADD spec.collection.type and set it to "vector" if previously unspecified. The documentation doesn't include that field, nor does the Release Notes for 5.7. It is gestured at in the Getting Started with Logging 5.7 doc.
It's probably worth noting somewhere in the Release notes or at least failing update if the Collector Preference prerequisite is unset. Or defaulting to what was previously the default?
Safe to close this issue if the maintainers feel this is not a bug/undesirable behaviour.
@nate-duke Yeah, chaning default behaviour in a minor release feels a bit sketchy. Maintainers should probably revert to the default behavour and mark that it will be depreciated. Then switch to vector as default in 5.0.0
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale