huge API load when using ose-logging-fluentd:4.5.0 in ocp 3.11.z
phhutter opened this issue · 4 comments
The openshift4/ose-logging-fluentd:4.5.0 container creates a huge load on the ocp 3.11.z API. Whereas I tried to reproduce the same behavior with version 4.2.0 4.3.0 and 4.4.0.
the only affected version seems to be 4.5.0.
60-70minutes after the first deployment, the api audit log starts logging the following request from fluentd
"watch" on namespace.
source was the SA located in the fluentd project:
/api/v1/watch/namespaces?resourceVersion=240482770
/api/v1/watch/namespaces?resourceVersion=240482747
/api/v1/watch/namespaces?resourceVersion=240482829
/api/v1/watch/namespaces?resourceVersion=240482816
...
...
We can see this watch request around 60k/minute and its getting even worse.. after one day we collected over 200gb audit logs triggerd by the fluentd container.
how to reproduce:
- enable API audit messages, at least "verb:watch"
- deploy openshift4/ose-logging-fluentd:4.5.0 on a ocp 3.11.z cluster.
- wait 1-2 hours.
- check API audit logs.
As a sidenote:
60-70minutes after the first deployment, the api audit log starts logging the following request from fluentd
From my understanding, when a client watches a resource, the API would only log that access upon closing that connection. It's not unlikely those first logs you would see with a ~1h delay are just the first round of watch closing up/being re-opened.
-- if so, something like netstat -plant | grep ESTABLISHED
should show lots of handlers opened by the fluentd process, to (at least) one of your API servers.
Regardless, if that's new with the 4.5.0 image --or one of the plugins it ships with--, that's a bit concerning.
Why would fluentd obsess over the Kubernetes API, watching namespaces?
Might be nice if you could share your fluentd configuration.
looks like this issue could be related to fabric8io/fluent-plugin-kubernetes_metadata_filter#224.
The 4.5.0 image ships with the 2.4.2 version of the fluent-plugin-kubernetes_metadata_filter plugin wherease the 4.4.0 comes with 2.4.1, if I'm correct.
When downgrade the fluent-plugin-kubernetes_metadata_filter plugin to version 2.4.1, everything seems to work fine again.
I've used a different fluentd.conf created by my own, which doesn't set the "watch" parameter inside the following section:
<filter kubernetes.**>
@type kubernetes_metadata
</filter>
regarding the official documentation of the fluent-plugin-kubernetes_metadata_filter plugin the parameter is set to true by default.
https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/master/README.md#L45
> watch - set up a watch on pods on the API server for updates to metadata (default: true)
whereas the redhat official fluentd.conf set it to false by default
https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/configs.d/openshift/filter-k8s-meta.conf#L5
> watch "#{ENV['K8S_METADATA_WATCH'] || 'false'}"
Setting the watch parameter to false will fix the issue.
No downgrade needed. - I'll close the PR.