kubernetes-csi/external-provisioner

--retry-interval-max didn't change the max retry interval

jakeslee opened this issue · 10 comments

I use command line option --retry-interval-max=1h to start a provisioner container as sidecar with my custom CSI plugin. According to the manual(https://github.com/kubernetes-csi/external-provisioner#command-line-options) introduced, I expect the retry interval of failed provisioning is up to 1 hour, but CreateVolume was called every 15 minutes actually.

I found that codes in controllers.go setup a periodic resync job which produce an Add Event every 15 minutes.

if controller.claimInformer != nil {
controller.claimInformer.AddEventHandlerWithResyncPeriod(claimHandler, controller.resyncPeriod)
} else {
controller.claimInformer = informer.Core().V1().PersistentVolumeClaims().Informer()
controller.claimInformer.AddEventHandler(claimHandler)
}

So, is that designed?

What happened:

call CreateVolume every 15 minuts.

What you expected to happen:

call CreateVolume per hour.

How to reproduce it:

set --retry-interval-max=1h and CreateVolume returns error continuously.

Anything else we need to know?:

Environment:

  • Driver version: csi-provisioner:v2.2.2
  • Kubernetes version (use kubectl version): OCP 4.8
  • OS (e.g. from /etc/os-release): CoreOS
msau42 commented

v2.2 is out of support now. Can you upgrade to a newer version and see if it's reproducible?
/triage needs-more-information

@msau42: The label(s) triage/needs-more-information cannot be applied, because the repository doesn't have them.

In response to this:

v2.2 is out of support now. Can you upgrade to a newer version and see if it's reproducible?
/triage needs-more-information

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

I checked this issue in v3.4.0 and master branch. I found that the problem is caused by Informer Resync mechanism.

The ProvisionController register an event handler ResourceEventHandlerFuncs to handle the Sync event. When Sync event was distributed, the function UpdateFunc of the handler will be called, then enqueue PVC to claimQueue.

informer := informers.NewSharedInformerFactory(client, controller.resyncPeriod)
// ----------------------
// PersistentVolumeClaims
claimHandler := cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { controller.enqueueClaim(obj) },
UpdateFunc: func(oldObj, newObj interface{}) { controller.enqueueClaim(newObj) },
DeleteFunc: func(obj interface{}) {
// NOOP. The claim is either in claimsInProgress and in the queue, so it will be processed as usual
// or it's not in claimsInProgress and then we don't care
},
}
if controller.claimInformer != nil {
controller.claimInformer.AddEventHandlerWithResyncPeriod(claimHandler, controller.resyncPeriod)
} else {
controller.claimInformer = informer.Core().V1().PersistentVolumeClaims().Informer()
controller.claimInformer.AddEventHandler(claimHandler)
}

controller.enqueueClaim() function adds the PVC to claimQueue processing queue directly, so that the item does not follow the exponential ratelimit(call AddRateLimited() followed, Add() not).

In the end, PVC dequeued in function ProvisionController.processNextClaimWorkItem() is used to do provision operation(call CreateVolume).

I wrote a nested exponential ratelimiter in CSI plugin. Is there better trick to deal with case like that?

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

/remove-lifecycle rotten

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.