Intermittent failing CI indicating test suite documentation needs to be regenerated

Question

Intermittent failing CI indicating test suite documentation needs to be regenerated

BenjaminPelletier opened this issue 9 months ago · 3 comments

BenjaminPelletier commented 9 months ago

Describe the bug
The main branch CI following the merge of #618 had a failing uss_qualifier CI test indicating:

+ docker run --name test_suite_docs_formatter --rm -v /home/runner/work/monitoring/monitoring:/app -e MONITORING_GITHUB_ROOT= interuss/monitoring uss_qualifier/scripts/in_container/format_test_suite_docs.sh --lint
Test suite documentation must be regenerated with `make format`: /app/monitoring/uss_qualifier/suites/uspace/flight_auth.md
Test suite documentation must be regenerated with `make format`: /app/monitoring/uss_qualifier/suites/uspace/required_services.md
Test suite documentation must be regenerated with `make format`: /app/monitoring/uss_qualifier/suites/faa/uft/message_signing.md
Test suite documentation must be regenerated with `make format`: /app/monitoring/uss_qualifier/suites/astm/utm/f3548_21.md

This is particularly strange because the repository hygiene CI test passed on the same CI run.

Then, the CI for a new PR (#619) indicated failure in both the Repository hygiene and uss_qualifier tests. This is at least self-consistent (if uss_qualifier test fails for a hygiene reason, that hygiene failure should show up in the Repository hygiene test also), but make lint does not indicate any failure locally for me, nor does make format change any files locally for me.

I reran all jobs in the main branch post-merge CI, and in this CI run, Repository hygiene passed and uss_qualifier test passed as well.

So, this is apparently an intermittent bug, but that's actually more concerning than something reproducible.

After the main branch CI completed successfully upon rerunning, I reran just the failing jobs for #619, but encountered the same failures. Then I tried rerunning all jobs for #619 and that worked. So, apparently all jobs should be rerun (not just the failed ones) if something like this is encountered in the future.

To reproduce
Presumably, trigger any CI run with code based on the current head of the main branch. But, the bug appears to be intermittent; see description above.

Difference from expected behavior
Most importantly, the outcome of CI checks should be the same whether they are run in a GitHub action or on a developer's local machine.

Secondarily, the test suite documentation should not indicate that it needs to be regenerated.

Screenshots

System on which behavior was encountered
Primarily on GitHub. No failures indicated on a local Debian-variant Linux machine.

Codebase information
Output of git log -n 1:

$ git log -n 1
commit 110c5c4e4096093cbc941d37810772d1fae7a75c (HEAD -> main, interuss/main)
Author: Benjamin Pelletier <BenjaminPelletier@users.noreply.github.com>
Date:   Wed Apr 3 09:10:08 2024 -0700

    [uss_qualifier] Remove output_path from configuration (#618)
    
    * Remove output_path from configuration
    
    * Remove output_path from CI configs

Output of git status:

$ git status
On branch main
Your branch is up to date with 'interuss/main'.

nothing to commit, working tree clean

Additional context
Add any other context about the problem here, or remove this section.

Answer 1 · 2024-04-05T23:45:02.000Z

This happened again on #623 and re-running all jobs did not fix the problem. Same with #624. With this level of impact, upgrading to P1.

Answer 2 · 2024-04-06T05:17:11.000Z

With #625, the issue appears to be that the test suite documentation is now interpreting the child class F3548-specific flight planner prep scenario as its generic parent class scenario. I'm not sure why this started happening. I suspect one of our unpinned dependencies (perhaps Python 3.11 itself) changed behaviors and if I clear my docker cache and rebuild I will be able to reproduce. However, I have not yet had a chance to attempt this diagnostic step.

Answer 3 · 2024-04-11T15:28:41.000Z

I'm having a look at this. As a sidenote, I noticed that make format will fail on my M2 laptop: (pip install's step in the Dockerfile fails)

Dockerfile:25
--------------------
  23 |     RUN mkdir -p /app/monitoring
  24 |     COPY ./requirements.txt /app/monitoring/requirements.txt
  25 | >>> RUN pip install -r /app/monitoring/requirements.txt
  26 |     
  27 |     RUN rm -rf __pycache__
--------------------

It seems specific to building on ARM, and I'll need to fix this first.

After some poking around: #628 shows that the behavior persists with Python 12
I can't reproduce locally even from a fresh docker build

After some more poking: I re-ran #627 three times without any failure. Possibly it solves it? (The PR bundles the #626 together with the fix for ARM to allow me to run make format locally)