huge API load when using ose-logging-fluentd:4.5.0 in ocp 3.11.z

Question

huge API load when using ose-logging-fluentd:4.5.0 in ocp 3.11.z

phhutter opened this issue 4 years ago · 4 comments

The openshift4/ose-logging-fluentd:4.5.0 container creates a huge load on the ocp 3.11.z API. Whereas I tried to reproduce the same behavior with version 4.2.0 4.3.0 and 4.4.0.
the only affected version seems to be 4.5.0.

60-70minutes after the first deployment, the api audit log starts logging the following request from fluentd

"watch" on namespace.
source was the SA located in the fluentd project:

/api/v1/watch/namespaces?resourceVersion=240482770
/api/v1/watch/namespaces?resourceVersion=240482747
/api/v1/watch/namespaces?resourceVersion=240482829
/api/v1/watch/namespaces?resourceVersion=240482816
...
...

We can see this watch request around 60k/minute and its getting even worse.. after one day we collected over 200gb audit logs triggerd by the fluentd container.

how to reproduce:

enable API audit messages, at least "verb:watch"
deploy openshift4/ose-logging-fluentd:4.5.0 on a ocp 3.11.z cluster.
wait 1-2 hours.
check API audit logs.

Answer 1 · 2020-09-01T07:34:45.000Z

As a sidenote:

60-70minutes after the first deployment, the api audit log starts logging the following request from fluentd

From my understanding, when a client watches a resource, the API would only log that access upon closing that connection. It's not unlikely those first logs you would see with a ~1h delay are just the first round of watch closing up/being re-opened.
-- if so, something like netstat -plant | grep ESTABLISHED should show lots of handlers opened by the fluentd process, to (at least) one of your API servers.

Regardless, if that's new with the 4.5.0 image --or one of the plugins it ships with--, that's a bit concerning.
Why would fluentd obsess over the Kubernetes API, watching namespaces?
Might be nice if you could share your fluentd configuration.

Answer 2 · 2020-09-01T12:57:38.000Z

looks like this issue could be related to fabric8io/fluent-plugin-kubernetes_metadata_filter#224.

The 4.5.0 image ships with the 2.4.2 version of the fluent-plugin-kubernetes_metadata_filter plugin wherease the 4.4.0 comes with 2.4.1, if I'm correct.

Answer 3 · 2020-09-01T15:13:07.000Z

When downgrade the fluent-plugin-kubernetes_metadata_filter plugin to version 2.4.1, everything seems to work fine again.

Answer 4 · 2020-10-02T14:18:20.000Z

I've used a different fluentd.conf created by my own, which doesn't set the "watch" parameter inside the following section:

<filter kubernetes.**>
  @type kubernetes_metadata
</filter>

regarding the official documentation of the fluent-plugin-kubernetes_metadata_filter plugin the parameter is set to true by default.

https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/master/README.md#L45
> watch - set up a watch on pods on the API server for updates to metadata (default: true)

whereas the redhat official fluentd.conf set it to false by default

https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/configs.d/openshift/filter-k8s-meta.conf#L5
> watch "#{ENV['K8S_METADATA_WATCH'] || 'false'}"

Setting the watch parameter to false will fix the issue.
No downgrade needed. - I'll close the PR.