kubernetes-sigs/gcp-compute-persistent-disk-csi-driver

Sanity test do not work with the PDCSI driver attach detach behavior change

saikat-royc opened this issue · 33 comments

#988 introduces a behavior change in PDCSI driver controller publish/unpublish call. If any error is encountered in the controller publish or unpublish path, the driver marks the target node with a backoff condition. The implication of this is that, until the backoff time has expired, all ControllerPublish/Unpublish calls for the target node will be directly denied by the driver.

The sanity tests runs a series of negative/positive test cases on the same underlying driver. This means if a negative test case for controller publish or unpublish runs first (e.g ControllerPublishVolume should fail when the volume is already published but is incompatible code) this failure causes the driver to mark the target node in a backoff condition. This causes the next postive controller publish test to fail with error (Controller Service [Controller Server] should work code)

rpc error: code = Unavailable desc = ControllerPublish not permitted on node "projects/test-project/zones/country-region-zone/instances/test-name" due to backoff condition

the test cases can fail randomly based on the sequence of the test cases run.

See a failed run here

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

/lifecycle frozen

I noticed there are a few more failures now so I am going to pick this up and try to fix them.

[Fail] Controller Service [Controller Server] [AfterEach] ControllerPublishVolume should fail when the volume is already published but is incompatible 
/usr/local/google/home/judemars/go/src/sigs.k8s.io/gcp-compute-persistent-disk-csi-driver/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113

[Fail] Node Service NodeUnpublishVolume [It] should remove target path 
/usr/local/google/home/judemars/go/src/sigs.k8s.io/gcp-compute-persistent-disk-csi-driver/vendor/github.com/kubernetes-csi/csi-test/v4/pkg/sanity/node.go:250

[Fail] Node Service NodeGetVolumeStats [It] should fail when volume does not exist on the specified path 
/usr/local/google/home/judemars/go/src/sigs.k8s.io/gcp-compute-persistent-disk-csi-driver/vendor/github.com/kubernetes-csi/csi-test/v4/pkg/sanity/node.go:250

[Fail] Node Service [It] should work 
/usr/local/google/home/judemars/go/src/sigs.k8s.io/gcp-compute-persistent-disk-csi-driver/vendor/github.com/kubernetes-csi/csi-test/v4/pkg/sanity/node.go:140

Okay so #1083 fixes the latter 3 of those failures, but the first one (which this bug was originally about) still exists.

Note that (as @saikat-royc pointed out to me) the backoff behavior changed from per node to per disk in #1028, so the order of tests should no longer be a problem.

So presumably the problem is coming from within the test case 👻 . This is my understanding of the scenario of the failing test:

  1. We create a volume -> succeeds
  2. We publish that volume -> succeeds
  3. We re-publish that volume, expecting an error bc it was already published -> fails (and triggers backoff)
  4. Cleanup function tries to unpublish volume, and this fails the test case with the error ControllerUnpublish not permitted on node "projects/test-project/zones/country-region-zone/instances/test-name" due to backoff condition because of Step 3.

@saikat-royc / @mattcary: Let me know if it sounds correct that step 3 would trigger step 4 to fail and if so, what y'all think we should do. Maybe the answer is to add an exception to backoff for unit tests?

We have another issue open on sanity tests from a little bit ago---I think the basic problem is that the sanity tests are starting to test our mock more than our actual code. At the time I was thinking we would want to change the sanity tests to use an actual gcp client rather than a mock. Then all these problems go away I think, even this third one? But I'm not sure.

Oh alrighty I see #991. Do we want to abandon #1083 and work on that instead? Or go ahead with it, fix the other issue (that this bug is about) and then add them as a PR gate so the tests don't get further broken while we work on #991?

Separately, I'm still curious about the best way forward to resolve the backoff problem in this test. I'm not convinced that the mock version of the device utils is to blame for that problem, since (if my theory is right above) the prior failure in the same test case is causing the backoff error later in the test.

Ah, right you are about the backoff problem. Hmm.

Maybe the best way forward is this: could you summarize what the sanity tests do? Then maybe it will be clear what the assumptions are around mocks and around stateful behavior of the driver.

I summarized the failing test in question above (see steps 1-> 4), but an explanation of all the tests is at https://github.com/kubernetes-csi/csi-test/blob/master/pkg/sanity/README.md. They are supplied by CSI itself (not in our repo) and verify basic adherence of our driver to the CSI spec.

That page doesn't actually describe what's going very well. What exactly is configured? Is a driver process spun up? Is there any organization to the sanity tests?

Ah sorry - ok here is where the tests are configured.

Looks like we mock out:

But these are real (using those faked dependencies):

  • identityService
  • nodeService
  • controllerService
  • gceDriver itself

Then here we set up a real gceDriver to run at the endpoint we pass into the test functions

For this test in particular in regards to how much is mocked out, we are exercising this part of the real controller code that checks for incompatibility with existing disks, even though the specifics of the disks themselves are being supplied by the mock.

The history isn't great on why these sanity tests were invented but it seems from the name at least to want to be a lightweight check for basic functionality of the driver's various contracts w/ CSI. When you say replacing the mocks with "an actual gcp client" what exactly are you referring to - talking to the real GCE instance? It feels to me if that's what we want to do it may be better to just fold these into e2e tests rather than having a separate "sanity" test.

Lastly, I'm not sure what you were thinking re: organization but there are different batches of tests defined by files in https://github.com/kubernetes-csi/csi-test/tree/master/pkg/sanity.

When you say replacing the mocks with "an actual gcp client" what exactly are you referring to - talking to the real GCE instance? It feels to me if that's what we want to do it may be better to just fold these into e2e tests rather than having a separate "sanity" test.

Yeah, that's what I was thinking. It may be that our e2e tests could be made more focused if they overlap with sanity.

I wonder if we could use beforeeach/aftereach stanzas to create a new driver before each test (or recreate it if it's in an error state, if creating one before each test is too expensive), to solve the backoff problem.

I'm still not sure that would help this specific test failure since the backoff occurs (as expected) within the test, which affects the cleanup in a later part of the same test.

Ah crotte.

Hmm, the are run in different It()s, so I'd think there'd be some flavor of beforeeach we could use? Is there a beforeDescribe? (there doesn't seem to be, but maybe beforeEach is sufficient?)

They all run in the same It() for this case - this one. The problem of separate test cases affecting eachother's backoff scenarios no longers presents after your change in #1028. We only have one test (in one It()) that runs the 4 steps above and fails. And if I'm understanding backoff correctly, I think it's working as expected since step 3 triggers backoff and causes step 4 to error.

@saikat-royc Can we change the controller to not put the driver in the backoff condition if the error isn't a "rate limit exceeded" from GCE API? From what I recall, we added node backoff logic in #847 due to VM queue. I think that would fix the sanity tests, since we're not encountering this specific error (it's a volume already published error).

We also want to backoff if the detach fails due to the node not existing or something like that, right?

I think that the assumption the sanity tests make that the tests are all independent is just wrong and we should fix that---if need be by recreating the driver for each It(.) (which is what ginkgo assumes IIUC).

Re: my comment above, the problem is happening within one test, one It(). For the one failing test, within one It() we do:

  1. We create a volume -> succeeds
  2. We publish that volume -> succeeds
  3. We re-publish that volume, expecting an error bc it was already published -> fails (and triggers backoff)
  4. Cleanup function tries to unpublish volume, and this fails the test case with the error ControllerUnpublish not permitted on node "projects/test-project/zones/country-region-zone/instances/test-name" due to backoff condition because of Step 3.

Another possible solution here maybe is to make the backoff conditions RPC-specific? So that failing a publish request would not stop an unpublish request from failing, which seems to be the case here.

We also want to backoff if the detach fails due to the node not existing or something like that, right?

@mattcary Is this just to preemptively get ahead of quota issues before they happen? (eg: if a node doesn't exist, presumably we wouldn't want to continue checking and reach 60s user quota limits for the GCE API)

@judemars in that test you reference, it looks like it only fails at the end of the It()?

@pwschuurman There's only one quota for all api calls, so if we wait until the detach calls hit quota, we'll have other operations unnecessarily fail.

Also given the problems we have with arcus queuing generally I think that making an effort not to make the problem worse will reduce stability.

I don't think we should change the driver behavior due to limitations in our testing infrastructure.

Correct, it fails w/ a backoff error in the cleanup function at the end of the It(), since a previous call within the same test case (still within the same It()) to publish a volume failed. And the previous failure was expected since that is what the test case is testing, which is "should fail when the volume is already published but is incompatible"

@judemars Making them rpc-specific might be a solution, but since quotas are counted over all RPCs I think we'd get complicated quickly trying to track stuff.

We've already had churn going from per-node backoff to per-node/per-volume backoff. As far as we know, the current system works well in practice, and as mentioned above I'm hesitate to complicate the driver behavior in order to make the tests run better. It seems like a much better tradeoff to complicate the tests since if we get it wrong it's much easier to fix.

@judemars Oh, interesting, I didn't realize it was the cleanup function that was failing.

In many of the other tests----the GKE guitar ones in particular---we make errors in the cleanup functions informational only, and they don't fail the test. Maybe that would be a better way to do it?

Wouldn't that lead to leaking GCE resources though, if we moved to a real GCE instance backing these tests? Updating these tests to make them informational might be fine if they were specific to our driver but actually they are shared by all CSI drivers so I would guess there's a higher bar for cleanliness.

Fair point.

If we retried on the cleanup, would that fix it?

Ah yes looks like if we tried after 200ms it would work. Maybe another option that would let us not have to insert a retry with wait into the shared tests is to set the backoff initial time period way lower for the sanity tests? So that by the time the cleanup happened the backoff wait period had already finished?

That's slightly non-ideal though since it will potentially make all backoff functionality untested in these tests. I can also see if there is a way to set it low for just this one test?

That could be a good idea.

Since the backoff appears to break the sanity test's assumption of how the driver works, maybe it's okay to set the threshold very low. It seems like the only thing backoff can do is make a false positive failure.

(assuming we can set the backoff low enough that the cleanup isn't flaky)

Alrighty I added #1087 for lowering the backoff duration for sanity tests.

Note I also mocked out what it would look like to have per-RPC backoff conditions in #1089 to see how big of a change it would be and it doesn't look like too much. I personally prefer that approach over #1087 as it doesn't diverge the test and real behaviors of the driver at all, but it's definitely true I don't have knowledge of the history for backoff so I'll defer to your decision.

+1 to #1087 (put my comments in the PR) . I feel #1089 is a large hammer which is not needed at this point in time.

Alrighty, abandoned #1089 in favor of #1087