microsoft/rushstack

[rush] Flaky Test Quarantine

elliot-nelson opened this issue · 1 comments

Summary

As a Rush monorepo maintainer, I wish I could offer developers a comprehensive strategy for tracking and addressing flaky tests across all projects.

This could be a feature, coded in Rush, or just a strategy, documented as part of Rushstack, that could be implemented on a company-by-company basis... either is better than nothing!

Details

I'm not married to any particular implementation here, but ideally, the strategy would work across unit tests (written in Jest) and integration tests (the kind you'd write with Cypress or Playwright). This suggests to me it's not a "Jest feature", but at some level higher.

Determining when a test is flaky

A comprehensive strategy must be able to determine flakiness without human intervention. Here's a possible approach:

  • Run the Jest test suite for a given project.
  • If the suite fails, take all failed unit tests in the run and request a re-run of Jest for just those tests.
  • If the suite fails again, repeat this process (up until a designated limit).

This is one way you could put flaky tests into a list of "known flaky tests" -- if, during the test phase of a project, you manage to get a specific unit test to both pass and fail on the same compiled code, you've proven it is flaky. If a test is proven flaky enough times (i.e. on N builds, where N = 1, N = 10, some number in between), it is put in quarantine.

Putting tests in quarantine

Once a test has been determined flaky, one approach for dealing with the situation is to put it in quarantine. "Quarantined" tests are tests that CI still runs, but, if they fail, they do not fail the build.

A quarantined test is a big deal for a development team, is (even though it's not deleted) it no longer provides a quality gate. The list of currently quarantined tests could be kept highly visible, for example, using Danger to present them in a PR comment on each PR.

Escaping from quarantine

For a unit test in quarantine to "escape", it must succeed multiple times in a row. You could do a periodic process on a main branch to achieve this, although it offers no benefit to a developer attempting to fix a quarantined test.

From the perspective of a developer tasked with fixing a unit test, an ideal approach would be similar to the "Determining a test is flaky", but in reverse:

  • Run the Jest test suite for a given project.
  • Collect the list of known quarantined tests, and for all of them that have succeeded, re-run them.
  • Continue to re-run quarantined tests that have been successful until some threshold has been reached.

The "fixed" state here could vary depending on context -- if a "fixed" test arrives in main, we can remove it from quarantine. If it's a PR build, perhaps a helpful message in a PR comment is more appropriate.

Quarantine implementation

The implementation of a test quarantine is its own topic. I believe such a thing must be tracked outside the main git history of the monorepo to be successful, but that implies deciding exactly how they are stored (does the database of quarantined tests track what branch they are detected in, does it do commit analysis to determine where a fix was introduced, etc.).

Standard questions

Please answer these questions to help us investigate your issue more quickly:

Question Answer
@microsoft/rush globally installed version?
rushVersion from rush.json?
useWorkspaces from rush.json?
Operating system?
Would you consider contributing a PR? Maybe!
Node.js version (node -v)?

In the old days, Microsoft Office had an end-to-end test automation system called Big Button (BB) that performed this sort of analysis. It distinguished between tests that were only invoked manually, versus tests that blocked merging of a branch (so-called Branch Validation Tests or BVT's). How I remember it, in order to enable a test as a BVT, that test had to prove its stability by completing 500 consecutive runs without any failures. The system would also automatically remove a BVT if it was detected to be "flakey", which was based on criteria such as a failing a certain number of times in the main branch.

if, during the test phase of a project, you manage to get a specific unit test to both pass and fail on the same compiled code, you've proven it is flaky

IIRC the BVT flakiness detection did not consider failures in a feature branch, only in the main branch. The rationale is that half-baked source code can cause nondeterministic behavior that is not the fault of the test.

These same flakiness principles probably apply of all kinds of tests, however it's unclear whether a single implementation can handle both unit tests and non-unit tests. (For this topic, the typical non-unit tests would be integration tests, end-to-end tests, and screen diff tests.) While Jest tests are always invoked by Rush and/or Heft, the launching of non-unit tests seems to vary widely across monorepos and even across projects within a single monorepo. I've seen approaches such as:

  • Non-unit tests are launched directly by rush test, which waits for them to complete
  • rush build writes JSON files that describe test selections, and then a separate CI task reads these JSON files and launches tests and reports the results separately from Rush
  • An entirely separate CI pipeline queries Rush (rush list --impacted-by git:main) only to determine the affected projects
  • An entirely separate CI pipeline is triggered by Git globs and Rush is not involved at all

Rather than trying to design a universal framework, would it make sense to start by solving the problem for Jest only?