cloudfoundry/cf-acceptance-tests

CATS failures for Zero downtime deploy using rolling strategy

Closed this issue · 4 comments

Hello, We are facing a unique scenario while running CATS in our CF environments. We have been facing consistent failures of App with zero downtime tests defined in apps/rolling_deploy.go. Initially, we assumed this was a flaky test but, it seems to fail each time in all our CF environments with the following error :

[2021-02-16 07:24:41.54 (UTC)]> cf delete CATS-6-APP-66bed8a2fef76530 -f -r 
Deleting app CATS-6-APP-66bed8a2fef76530 in org CATS-6-ORG-bb58815fbab84c51 / space CATS-6-SPACE-a777d781cc84b43f as CATS-6-USER-0040cd1ab774609a...
OK

------------------------------
• Failure [100.027 seconds]
[apps]
/go/src/github.com/cloudfoundry/cf-acceptance-tests/cats_suite_helpers/cats_suite_helpers.go:41
  Rolling deploys
  /go/src/github.com/cloudfoundry/cf-acceptance-tests/cats_suite_helpers/cats_suite_helpers.go:47
    deploys the app with zero downtime [It]
    /go/src/github.com/cloudfoundry/cf-acceptance-tests/apps/rolling_deploy.go:40

    Expected
        <int>: 52
    to match exit code:
        <int>: 0

    /usr/local/go/src/runtime/asm_amd64.s:1374
------------------------------

While executing the tests, we have specified the attribute -flakeAttempts=3 to retry tests at least 3 times, however, it still fails all the 3 times with the same error.

CF CLI version used : 7.1.0+4c3168f9a.2020-09-09
Number of Nodes used : 12
CATS BRANCH used for tests : cf16.2

We are running CATS in a concourse pipeline which is used to deploy our CF environments. So, to investigate this particular failure, we hijacked the container and tried running the individual tests for rolling_deploy.go using FDescribe in the code. However, we got the same behaviour replicated there as well, while only running zero downtime rolling deploy test and skipping all other tests.

After spending a few hours on the code, we found that there is a ticker ticker := time.NewTicker(1 * time.Second) defined within the test, which gets ticked every 1 second . This ticker is eventually used inside a go routine which is spawned as a separate thread/process, on every tick. And, within the test, a case statement is defined for every tick the channel emits, which invokes an APP root URI CURL defined as

case <-tickerChannel:
appResponse := helpers.CurlAppRoot(Config, appName)

We believe that the problem seems to be related somehow with the ticker channel time defined within the test. Increasing the ticker channel time to 10 seconds seems to ensure that the tests has always passed and now it passes 10 times out of 10 with no flakiness observed.
Another thing we noticed within the test logs, was that this error was always printed after the delete app, as shown in the error log above . We assume, this would have ran as part of the "delete" invoked in the AfterEach() segment. But, we hope that's not relevant while running things concurrently, as logs will not be in any particular order.

Based on the error information, our current assumption is that the rolling deploy tests fails to complete an App root URI curl . This could have happened while the go routine was still trying a curl , when the older app is in the process of getting deleted and the request may was still routed to the app route , but the app itself has been deleted. But, this is still just a theory, as we are not certain what the exact reason could be here , which apparently gets fixed by increasing the time on ticker.

I would really appreciate any inputs on this and if anyone can please help us understand this weird behaviour. I would also be very keen to hear if this issue has been observed before ,and if indeed that is the case, will there be a possibility to make the time defined in ticker as a configurable item , which could be provided in cats config ?

Thanks
Sushant

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/177037340

The labels on this github issue will be updated when the story is started.

Hello Sushant,

Unfortunately we have not seen this test fail consistently like you describe, and we run it regularly in our CI environment. Based on the details you provided, I agree that the error is caused by the curl command being run in the go routine inside the test, but I don't believe that decreasing the ticker frequency is the right solution. The intention of that test is to ensure that zero-downtime deployments are truly zero-downtime, so it's important to check frequently that the app is still up during the deploy. Decreasing the frequency defeats that purpose.

I believe the failed expectation is in AppCurler.CurlAndWait(), when it asserts that the curl command should exit with a status code of 0. A status code of 52 points to an "Empty reply from server" (I found this related stackoverflow). One possible cause is that the test is using http to connect to an https-only server. Does that sound possible? We've seen a number of these types of failures on cf-for-k8s in the past, but I believe we've fixed all of them. One possible cause is that your test config may provide an http URI to the CATS when it should instead be using https.

You may have more luck debugging this issue by manually reproducing the steps that the test is taking on your infrastructure and see if you run into the same problem during a deploy. Also, it looks like the curl output should be included in the test output, so there may be more information there about what specific issue curl is encountering. That would be helpful in understanding exactly what's going wrong.

Hello Matt,

Thank you for your detailed response. I really appreciate it.

I see your perspective about ensuring app availability to support zero downtime deployments and fully agree with your point, that decreasing the ticker value, kind of defeats the purpose of the test.

I also checked our App URI based on the analysis you had provided for the curl 52 status code. I did notice that we seem to have an HTTPS URI set in our environment for the curl, so, I guess, that will probably rule out this scenario.

I will dig more into this again, focussing on this particular test in our infrastructure and also see if I can get more information out of the curl in the test output, as you suggested.

Hopefully, I will be back in a few days with more information related to this issue.

Nevertheless, your pointers and inputs were really helpful and will hopefully help me further in finding the root-cause of this issue within our infrastructure.

Cheers,
Sushant

Closing due to inactivity. Please feel free to re-open if required.