[ACTION NEEDED] Fix flaky integration tests at distribution level

Question

[ACTION NEEDED] Fix flaky integration tests at distribution level

gaiksaya opened this issue 6 months ago · 11 comments

What is the bug?
It was observed in 2.13.0 and previous other releases that this component manually signed off on the release for failing integration tests. See opensearch-project/opensearch-build#4433 (comment)
The flakiness of the test runs take a lot of time from the release team to collect go/no-go decision and significantly lower the confidence in the release bundles.

How can one reproduce the bug?
Steps to reproduce the behavior:

Run integration testing for altering and see the failures.
Issues can be reproduced using the steps declared in AUTOCUT issues for failed integration testing

What is the expected behavior?
Tests should be consistently passing.

Do you have any additional context?
Please note that this is a hard blocker for 2.14.0 release as per the discussion here

Answer 1 · 2024-04-23T22:12:32.000Z

@RyanL1997 @ps48 Can you please provide your inputs?

Answer 2 · 2024-04-25T17:16:11.000Z

We're working on it, a while back I asked about the failures in opensearch-project/opensearch-build#4635, it doesn't look like the distribution failures are from our tests but somewhere in the pipeline as far as I can tell. I've marked our distribution issues with "help wanted" where the issue is applicable.

Answer 3 · 2024-04-25T19:06:25.000Z

It also looks like many of the manifests are still showing a Not Available status, related to the discussion here, but it's showing them even for fresh logs so it doesn't seem to be an issue of the manifests being stale.

Answer 4 · 2024-04-25T20:03:24.000Z

Tagging @zelinh here to provide his inputs.

Answer 5 · 2024-04-25T20:09:02.000Z

Here are some reasons that it may show Not Available. https://github.com/opensearch-project/opensearch-build/tree/main/src/report_workflow#why-are-some-component-testing-results-missing-or-unavailable
@Swiddis Could you share one situation that is showing Not Available so I can look into it in more details.

Answer 6 · 2024-04-25T20:15:17.000Z

Could you share one situation that is showing Not Available so I can look into it in more details.

E.g. the 2.14 integration tests autocut, of the three most recent manifests at the time of writing, two of them are unavailable (most recent, second most recent (available), third most recent).

Answer 7 · 2024-04-25T20:53:26.000Z

Could you share one situation that is showing Not Available so I can look into it in more details.

E.g. the 2.14 integration tests autocut, of the three most recent manifests at the time of writing, two of them are unavailable (most recent, second most recent (available), third most recent).

I saw these in both of the unavailable runs. Seems like the process is terminated because of timeout limit when we run the integ tests for observabilityDashboards ; therefore it didn't run through all the test recording process.

Cancelling nested steps due to timeout
Sending interrupt signal to process

Session terminated, killing shell...Terminated
 ...killed.
script returned exit code 143

https://build.ci.opensearch.org/job/integ-test-opensearch-dashboards/5856/consoleFull
https://build.ci.opensearch.org/job/integ-test-opensearch-dashboards/5844/consoleFull
Both of these jobs run for more than 4 hours; while the available one run only 1.5 hour.
Do you have any idea why these jobs run longer than usual? @rishabh6788 @gaiksaya

Answer 8 · 2024-04-29T20:31:48.000Z

Hypothesis: The failing tests are flaky and the timeouts only happen if the tests pass (i.e. something later in the test suite is taking all the time). We only get the failure message when the earlier test fails and cuts the run short.

Based on this hypothesis I made opensearch-project/opensearch-dashboards-functional-test#1250 to fix the flakiness, but I'm still not sure what's causing the timeouts.

Answer 9 · 2024-04-30T17:23:49.000Z

For completeness I've checked the recent pipeline logs after the flakiness fix was merged, and am not seeing any integ-test failures for observability. https://build.ci.opensearch.org/blue/rest/organizations/jenkins/pipelines/integ-test-opensearch-dashboards/runs/5899/log/?start=0

I can find the interruption exception, but not the indication of what specifically is being interrupted (is some test hanging?):

org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 5a075705-b450-4433-85c4-0b5d9991ba84
org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
		at org.jenkinsci.plugins.workflow.steps.BodyExecution.cancel(BodyExecution.java:59)
		at org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.cancel(TimeoutStepExecution.java:197)
		at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67)
		at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
		at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
		at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	Caused: java.lang.Exception: Error running integtest for component observabilityDashboards
		at WorkflowScript.run(WorkflowScript:317)
		at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(Docker.groovy:141)
		at ___cps.transform___(Native Method)
		at java.base/jdk.internal.reflect.GeneratedConstructorAccessor790.newInstance(Unknown Source)

Answer 10 · 2024-04-30T20:22:58.000Z

Tagging @rishabh6788 to look in to the above failure ^

Answer 11 · 2024-06-14T18:56:24.000Z

Currently just held up by #1822