--retry-interval-max didn't change the max retry interval

I use command line option --retry-interval-max=1h to start a provisioner container as sidecar with my custom CSI plugin. According to the manual(https://github.com/kubernetes-csi/external-provisioner#command-line-options) introduced, I expect the retry interval of failed provisioning is up to 1 hour, but CreateVolume was called every 15 minutes actually.

I found that codes in controllers.go setup a periodic resync job which produce an Add Event every 15 minutes.

external-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/v8/controller/controller.go

Lines 687 to 692 in 21c2c40

    
           if controller.claimInformer != nil { 
        
           	controller.claimInformer.AddEventHandlerWithResyncPeriod(claimHandler, controller.resyncPeriod) 
        
           } else { 
        
           	controller.claimInformer = informer.Core().V1().PersistentVolumeClaims().Informer() 
        
           	controller.claimInformer.AddEventHandler(claimHandler) 
        
           }

So, is that designed?

What happened:

call CreateVolume every 15 minuts.

What you expected to happen:

call CreateVolume per hour.

How to reproduce it:

set --retry-interval-max=1h and CreateVolume returns error continuously.

Anything else we need to know?:

Environment:

Driver version: csi-provisioner:v2.2.2
Kubernetes version (use kubectl version): OCP 4.8
OS (e.g. from /etc/os-release): CoreOS

v2.2 is out of support now. Can you upgrade to a newer version and see if it's reproducible?
/triage needs-more-information

@msau42: The label(s) triage/needs-more-information cannot be applied, because the repository doesn't have them.

In response to this:

v2.2 is out of support now. Can you upgrade to a newer version and see if it's reproducible?
/triage needs-more-information

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

I checked this issue in v3.4.0 and master branch. I found that the problem is caused by Informer Resync mechanism.

The ProvisionController register an event handler ResourceEventHandlerFuncs to handle the Sync event. When Sync event was distributed, the function UpdateFunc of the handler will be called, then enqueue PVC to claimQueue.

external-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/v8/controller/controller.go

Lines 673 to 692 in 831b9c5

    
           informer := informers.NewSharedInformerFactory(client, controller.resyncPeriod) 
        
           // ---------------------- 
        
           // PersistentVolumeClaims 
        
           claimHandler := cache.ResourceEventHandlerFuncs{ 
        
           	AddFunc:    func(obj interface{}) { controller.enqueueClaim(obj) }, 
        
           	UpdateFunc: func(oldObj, newObj interface{}) { controller.enqueueClaim(newObj) }, 
        
           	DeleteFunc: func(obj interface{}) { 
        
           		// NOOP. The claim is either in claimsInProgress and in the queue, so it will be processed as usual 
        
           		// or it's not in claimsInProgress and then we don't care 
        
           	}, 
        
           } 
        
           if controller.claimInformer != nil { 
        
           	controller.claimInformer.AddEventHandlerWithResyncPeriod(claimHandler, controller.resyncPeriod) 
        
           } else { 
        
           	controller.claimInformer = informer.Core().V1().PersistentVolumeClaims().Informer() 
        
           	controller.claimInformer.AddEventHandler(claimHandler) 
        
           }

controller.enqueueClaim() function adds the PVC to claimQueue processing queue directly, so that the item does not follow the exponential ratelimit(call AddRateLimited() followed, Add() not).

In the end, PVC dequeued in function ProvisionController.processNextClaimWorkItem() is used to do provision operation(call CreateVolume).

I wrote a nested exponential ratelimiter in CSI plugin. Is there better trick to deal with case like that?

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

/remove-lifecycle rotten

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

	if controller.claimInformer != nil {
	controller.claimInformer.AddEventHandlerWithResyncPeriod(claimHandler, controller.resyncPeriod)
	} else {
	controller.claimInformer = informer.Core().V1().PersistentVolumeClaims().Informer()
	controller.claimInformer.AddEventHandler(claimHandler)
	}

	informer := informers.NewSharedInformerFactory(client, controller.resyncPeriod)

	// ----------------------
	// PersistentVolumeClaims

	claimHandler := cache.ResourceEventHandlerFuncs{
	AddFunc: func(obj interface{}) { controller.enqueueClaim(obj) },
	UpdateFunc: func(oldObj, newObj interface{}) { controller.enqueueClaim(newObj) },
	DeleteFunc: func(obj interface{}) {
	// NOOP. The claim is either in claimsInProgress and in the queue, so it will be processed as usual
	// or it's not in claimsInProgress and then we don't care
	},
	}

	if controller.claimInformer != nil {
	controller.claimInformer.AddEventHandlerWithResyncPeriod(claimHandler, controller.resyncPeriod)
	} else {
	controller.claimInformer = informer.Core().V1().PersistentVolumeClaims().Informer()
	controller.claimInformer.AddEventHandler(claimHandler)
	}