opensearch-project/opensearch-build

[RELEASE]: Stabilize the integration test runs for distribution builds, no manual sign-offs anymore

reta opened this issue ยท 22 comments

reta commented

Is your feature request related to a problem? Please describe

At the moment recurrent practice during the release process is to collect manual sign-offs for numerous integration test failures across different components. Some of them are constant offenders, others pop up from release to release. The flakiness of the test runs take a lot of time from the release team to collect go/no-go decision and significantly lower the confidence in the release bundles.

Describe the solution you'd like

For 2.14.0, we should stabilize the integration test runs for distribution builds and do not accept manual sign-offs anymore. Integration test failures mean no-go by definition.

Describe alternatives you've considered

Continue collecting manual sign-offs and burn time on that.

Additional context

All previous releases (2.13.0 / 2.12.0 / 2.11.0 / ...)

Children

I support this proposal. As far as I know, manual sign off caused delay in releasing 5+ minor versions (as additional time was required for manual sign off), and resulted in 7 patch releases (as bugs slipped through manual sign off) in 2023.

Here's a link to what's being manually signed off in 2.13: #4433 (comment)

I am supportive - thanks @reta for pursuing this. However, given the number of manual signoffs/flaky tests, can we agree on a % reduction goal for 2.14, followed by targets for 2.15 and beyond such that we can get down to 0 manual signoffs in 3-4 months. All teams are impacted and are burning down, however, the pace is slower than we expect.

The below components have most number of flaky tests. If we are taking the % reduction goal, we should take it at component level.

OpenSearch:

  • Reporting
  • SQL
  • Neural search
  • ml commons
  • Index management
  • alerting
  • CCR
  • Notifications
  • security-analytics

OpenSearch Dashboards:

  • Dashboards core
  • security analytics dashboards
  • IM Dashboards
  • AD dashboards
  • reports dashboards
  • Dashboard reporting
  • ganttchart dashboards
  • observability dashboards
  • alerting dashboards

Thanks @bbarani - can we take a goal of 50% reduction for each component by 2.14? It will be difficult to manage different reduction targets across different components.

Thanks @bbarani - can we take a goal of 50% reduction for each component by 2.14? It will be difficult to manage different reduction targets across different components.

Why 50% and not 100%? In 2.14 there were only 4 plugins that didn't manually sig-off. Any team that doesn't have stable tests should be focused on that and not adding new features.

Caught up with @bbarani, @dblock and @andrross. Let's prioritize eliminating the manual signoffs in 2.14 - we do not ship 2.14 unless we get green builds across core and repos.

I have created child issues for the components that manually signed off in 2.13.0. See above links.
Will tag the maintainers after a week if there is no response on any of them.

Thanks!

@bbarani You mentioned ISM as one of the components having high number of flaky tests. This isn't true and may not be true for some of the other plugins as well. As I called out in release meeting as well, for ISM primary reason of failures was update of wrong security certificates in the repo. This update in security certificate was suggested to plugins by security earlier to increase the expiry and fixed later by another PR opensearch-project/index-management#1147 by security itself.
But I agree that there are still some flaky tests which fail on certain platforms intermittently but I haven't seen them blocking the release.
@gaiksaya Can I request you to change status of ISM release from manual sign-off to fixed tests so that it gives the right picture because we did not provide any manual sign-off? I am assuming that you referred to the passed builds of PR instead of re-executing jenkins for the sign-off.

Hi @vikasvb90 ,

Even before security changes were merged, index management plugin has seen lot of flaky tests in the past. See the issues opened and closed multiple times in the past releases here. The re-executing of jenkins would not have fixed it as a new RC would have to be generated as integration test framework checks out the exact commit that was used to build RC.

We also recently matched the GitHub Actions specifications on Jenkins. See opensearch-project/opensearch-ci#412
We can use 2.14.0 release as testing environment for this. Would that be okay?

@gaiksaya I didn't mean that there are no flaky tests in ISM, I have mentioned this in my comment as well. We do have some flaky tests and we have been making improvements as well. I am pointing out the reason of failures in the last release based on which this was stated most number of flaky tests
So, to re-iterate, 2.13 got blocked specifically because of security and ISM had nothing to do with it.

@vikasvb90, thank for your comment. This issue mainly focuses on remediating all flaky tests and doesn't focus on specific released OpenSearch version. The goal is to identify flaky tests and fix it before 2.14.0 release.

OpenSearch Dashboards did not pass the build CI for 2.14 even though it passed these tests locally and it it's own component CI. I will work with the build team to address it.

Thanks @ashwin-pc for getting the Dashboards testcases to run locally - we need all testcases running as part of CI in 2.15 without exceptions.

Update on passing the tests in the Build CI to avoid a manual signoff for 2.15:

To avoid having manual signoffs for OSD in 2.15, we need to split the OSD tests into groups. In 2.14 we had tests pass when run individually but fail when all the tests are run together. It appears that the issue doesn't lie with individual tests for OpenSearch Dashboards, but rather when running all the tests together. The OpenSearch Dashboards Github CI does not face this issue because we use CI groups to run the tests. The Functional Test Repo team has also observed that splitting the tests into separate CI groups helped resolve the failures.

We need to implement a similar approach for the Build CI as well. Instead of running all the tests together, we should partition them into smaller groups or batches. This way, if any test leaves behind residual data or fails to clean up properly, it won't affect the subsequent tests in the same batch.

While this may not address the root cause of the flakiness, it aligns with the previously successful solution of partitioning tests. Historically we have learned that if OSD tests run sequentially, we should expect failures. Investigating and fixing the underlying issues would require significant research efforts, which may not be feasible given the large number of inherited tests.

By partitioning the tests, we can mitigate the impact of flakiness and ensure more reliable test runs.

One alternative to this is to use @AMoo-Miki 's pipeline that has already implemented this grouping, that can use a built artifact and run the tests in groups.

reta commented

By partitioning the tests, we can mitigate the impact of flakiness and ensure more reliable test runs.

I think as a workaround, grouping looks like a step in right direction but one may still face the flakyness in scope of the single group since the cause (why sequential test runs are flaky) is still not understood.

By partitioning the tests, we can mitigate the impact of flakiness and ensure more reliable test runs.

I think as a workaround, grouping looks like a step in right direction but one may still face the flakyness in scope of the single group since the cause (why sequential test runs are flaky) is still not understood.

Another concern that I have is that the possibility of detecting a valid regression reduces when running in groups. For e.g., if a feature that causes performance regression is put in one group and by any chance is the last test of the particular group then subsequent groups will not detect that issue. As of now if that happens, the subsequent tests fail with timeout or other related errors.

I believe @SuZhou-Joe is looking into flakey test failures and has made some discoveries about their probable cause.
It will be helpful if you could add your findings on this issue.
cc: @peterzhuamazon

I believe @SuZhou-Joe is looking into flakey test failures and has made some discoveries about their probable cause. It will be helpful if you could add your findings on this issue. cc: @peterzhuamazon

Thanks @rishabh6788 .

Also waiting for @SuZhou-Joe to provide a possible fix to the root problem.
This should give us the confidence we need to proceed.

The initial changes on github actions seems like avoiding the issues rather than fixing it permanently:

Thanks.

I think as a workaround, grouping looks like a step in right direction but one may still face the flakyness in scope of the single group since the cause (why sequential test runs are flaky) is still not understood.

@reta Flakeyness within a single group is manageable since tests wont interfere with each other as much and all that the author of the test needs to ensure is that the group runs successfully. And because they are smaller groups, validating fixes to a group should be a lot quicker and more reproducible. Every fix made to the flakey tests so far has been addressing tests interfering with each other and not the core functionality/regression itself. And when each test run takes hours to complete, it makes debugging and addressing these tests so much harder.

Also we do know why the grouping fixes the tests. The CI uses a single low powered OpenSearch instance to run all the tests. For OpenSearch Dashboards, we have a lot more tests as compared to plugins, so when we run all the tests sequentially, even though we clean up the data after the tests, OpenSearch does not clear its cache in time and starts to throttle the requests. @SuZhou-Joe came up with a solution to clear the cache for now, but relying on deep dives like that isnt scalable.

By partitioning the tests, we can mitigate the impact of flakiness and ensure more reliable test runs.

I think as a workaround, grouping looks like a step in right direction but one may still face the flakyness in scope of the single group since the cause (why sequential test runs are flaky) is still not understood.

Another concern that I have is that the possibility of detecting a valid regression reduces when running in groups. For e.g., if a feature that causes performance regression is put in one group and by any chance is the last test of the particular group then subsequent groups will not detect that issue. As of now if that happens, the subsequent tests fail with timeout or other related errors.

@rishabh6788 I think this is unnecessary. Each of the fixes so far, including the ones that @SuZhou-Joe 's latest fix for running them sequentially has been to address tests interfering with each other. His fix actually just asks OpenSearch to clear its cache so that the next set of tests can run. This is actually a workaround. UI tests expect the application to be in a specific state to not be flakey and expecting them to work sequentally only makes this more fragile.

Also if these kinds of regressions are really a concern, why arent we running all functional tests (including that of other plugins) sequentially?

If we really want to catch accidental regressions, we should introduce fuzzy tests which by definition will be flakey but can catch errors like the ones that you are concerned about. As for the other integ tests, they are meant to validate a specific function and arent written to valid across tests.

Tests interfering with one another (either through concurrent execution or leaving modified state after test runs) is definitely a common problem. We should continue working towards a real solution to this problem so that tests are independent and reliable regardless of the context in which they run. Changing the test run grouping may be a good mitigation for now, but if we don't actually solve the root problem of test interference then we're likely to continue struggling with this going forward.

we should introduce fuzzy tests which by definition will be flakey

Fuzz tests should not be flaky. They should be deterministically reproducible with a given random seed, and any failures they uncover would be real bugs that need to be fixed.

The CI uses a single low powered OpenSearch

CI is using m5.8xlarge instances as far as I know. Is this not correct? This is a 32 core, 128MB memory instance. If using a more powerful machine will help I think we can explore doing that.