AleoNet/snarkVM

Flaky Tests on CI

Closed this issue ยท 3 comments

๐Ÿ› Bug Report

Issue: Flaky Tests on CI
Severity: Medium

Description

We have observed that some tests in our CI pipeline exhibit flaky behavior, requiring multiple runs to pass. This inconsistency is affecting the reliability and efficiency of our development process.

Affected Tests

algorithms - This test often fails unpredictably, and the root cause is currently unknown.

While not explicitly mentioned, other tests may also exhibit similar behavior, requiring multiple attempts to pass.

Steps to Reproduce

  • Run the CI pipeline.
  • Observe the failure of the algorithms test (and potentially others) intermittently.
  • Re-run the failed tests.
  • Notice that the tests may pass on subsequent attempts.

Expected Behavior

All tests should pass consistently on the first run, provided that the code is correct.

Actual Behavior

The algorithms test (and potentially others) fail intermittently without any changes to the code.
These tests often require multiple attempts to pass, leading to wasted time and resources.

Impact

Decreases confidence in the CI results.
Slows down the development process due to the need for re-running tests.
Makes it difficult to identify genuine issues in the codebase.

Possible Causes

Race conditions or timing issues within the tests or the code being tested.
Environmental issues related to the CI infrastructure.
Dependencies on external services or resources that may not be consistently available.

Suggested Actions

Investigation and Diagnosis
- Conduct a thorough investigation to identify the root cause of the flakiness in the algorithms test.
- Review the test code and the associated application code for potential issues.

Test Stabilization
- Implement fixes to address any identified issues causing the flakiness.
- Ensure that tests do not have hidden dependencies on external resources or timing conditions.

Enhancement of CI Infrastructure
- Ensure that the CI environment is consistent and reliable.
- Consider introducing additional logging or diagnostics to capture more information about the failures.

Documentation and Communication
- Document the findings and the steps taken to address the flaky tests.
- Communicate any changes to the team to ensure that everyone is aware of the improvements and any new best practices.

Additional Information

Please provide any logs or additional context that might help in diagnosing the issue.
If you have observed flaky behavior in other tests, please list them here as well.

Not sure if it helps but can we upgrade to Rust 1.79.0?

Some comments:

  • Don't think a Rust upgrade will help, flakiness has been an issue for a while
  • One frequent cause of flakiness across all crates is that parameter downloading fails - perhaps this is AWS rate-limiting
  • Separately from the downloads failing, indeed there seems to be too high resource usage for the algorithms crate. As this is heavily influenced by the particular environment, Provable will triage this on our own CI independently.

Funny enough that by lowering the resource class (or perhaps a fix in one of the PRs), CI is passing now for algorithms:
https://app.circleci.com/pipelines/github/AleoNet/snarkVM/13211/workflows/44a17171-197b-4df2-95c3-58e4180b57f8/jobs/576904