fabric8io/fluent-plugin-kubernetes_metadata_filter

Error while watching pods: too old resource version with v2.4.5

raja-gola opened this issue · 13 comments

Error while watching pods: too old resource version: 23630883 (23632006) (RuntimeError)

complete stacktrace

#<Thread:0x00007f53f192fb70@/usr/local/bundle/gems/fluent-plugin-kubernetes_metadata_filter-2.4.5/lib/fluent/plugin/filter_kubernetes_metadata.rb:274 run> terminated with exception (report_on_exception is true):
/usr/local/bundle/gems/fluent-plugin-kubernetes_metadata_filter-2.4.5/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:43:in `rescue in set_up_pod_thread': undefined method `<' for nil:NilClass (NoMethodError)
        from /usr/local/bundle/gems/fluent-plugin-kubernetes_metadata_filter-2.4.5/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:38:in `set_up_pod_thread'
        from /usr/local/bundle/gems/fluent-plugin-kubernetes_metadata_filter-2.4.5/lib/fluent/plugin/filter_kubernetes_metadata.rb:274:in `block in configure'
/usr/local/bundle/gems/fluent-plugin-kubernetes_metadata_filter-2.4.5/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:133:in `block in process_pod_watcher_notices': Error while watching pods: too old resource version: 23630883 (23632006) (RuntimeError)
        from /usr/local/bundle/gems/kubeclient-4.6.0/lib/kubeclient/watch_stream.rb:28:in `block in each'
        from /usr/local/bundle/gems/http-4.4.1/lib/http/response/body.rb:37:in `each'
        from /usr/local/bundle/gems/kubeclient-4.6.0/lib/kubeclient/watch_stream.rb:25:in `each'
        from /usr/local/bundle/gems/fluent-plugin-kubernetes_metadata_filter-2.4.5/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:110:in `process_pod_watcher_notices'
        from /usr/local/bundle/gems/fluent-plugin-kubernetes_metadata_filter-2.4.5/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:40:in `set_up_pod_thread'
        from /usr/local/bundle/gems/fluent-plugin-kubernetes_metadata_filter-2.4.5/lib/fluent/plugin/filter_kubernetes_metadata.rb:274:in `block in configure'
Unexpected error undefined method `<' for nil:NilClass
  /usr/local/bundle/gems/fluent-plugin-kubernetes_metadata_filter-2.4.5/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:43:in `rescue in set_up_pod_thread'
  /usr/local/bundle/gems/fluent-plugin-kubernetes_metadata_filter-2.4.5/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb:38:in `set_up_pod_thread'
  /usr/local/bundle/gems/fluent-plugin-kubernetes_metadata_filter-2.4.5/lib/fluent/plugin/filter_kubernetes_metadata.rb:274:in `block in configure'

similar error with 2.4.5 on 1.15

I have downgraded the plugin to v2.4.1 in fluent-operator Gemfile and I don't see any of these errors from past 2 days. BTW, it is on k8s version 1.16.3. So definitely an issue with 2.4.5

It looks like this is addressed in 2.4.6.

v2.4.5...v2.4.6#diff-1ef0b670f3d0a49f0c40eff0977bd52dR32

we have the same exact issue with GKE 1.16.8-gke.15.

I'm seeing the same error with v2.4.6

Digging into the "too old resource version" error a bit, I believe the problem lies in the handling of type ERROR watch responses here:

The Kubernetes concepts documentation explains this case and how clients should handle it in this document: https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes

Here's the relevant excerpt:

A given Kubernetes server will only preserve a historical list of changes for a limited time. Clusters using etcd3 preserve changes in the last 5 minutes by default. When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a list operation, and starting the watch from the resourceVersion returned by that new list operation. Most client libraries offer some form of standard tool for this logic. (In Go this is called a Reflector and is located in the k8s.io/client-go/cache package.)

@jcantrill I suspect that the fix is to add special handling for status code 410 Gone in the ERROR block, and to handle this case similarly to the way the plugins handles DELETE.

The kubeclient library that this plugin uses to perform the watch also explains that Whenever you ask for a specific version, you must be prepared for an 410 "Gone" error if the server no longer recognizes it.

See: https://github.com/abonas/kubeclient#starting-watch-version

According to the good folks on the kubeclient project, the way to check for the 410 status in the notice is:

notice['object']['code'] == 410

@jcantrill I've submitted a PR which I believe should solve this problem based on my learnings explained in the comments above.

One question for you: Should I include a minor version bump in my PR if I'd like this to go into a new release? Or do the maintainers handle that process yourselves?

@jcantrill I've submitted a PR which I believe should solve this problem based on my learnings explained in the comments above.

One question for you: Should I include a minor version bump in my PR if I'd like this to go into a new release? Or do the maintainers handle that process yourselves?

We'll bump the version when we publish

Awesome. Thanks, @jcantrill. Active maintainers like you are a treasure.