Test single-step fails randomly
Closed this issue · 5 comments
Here are the steps I follow to reproduce the issue:
- Get master branch and compile
- run from the shell while standing in tests directory:
while ../scripts/rip-environment runtests -p -s single-step; do true; done
- Wait until it fails (maybe do some other things on the machine while running)
This is what I expected to happen:
No failures, at all.
This is what happened instead:
Random fails. Sometimes it runs 1, 2, 3, 4, 5, 6,... times before failure. However, every instance so far failed eventually. The point of failure in the ngc-file seems to vary too (see attached logs).
It worked properly before this:
Unknown.
A CI failure after commit 697fe4d was suspicious. It turned out to be a failure in the single-step test, which I'd seen several times on my local machine.
Looking into the failure: it is always a timeout in tests/single-step/test-ui.py in function wait_complete_step(). Adding some debug messages seems to indicate that it always times out while the state is in linuxcnc.EXEC_WAITING_FOR_MOTION_AND_IO
(integer 7).
The fact that it fails randomly seems to indicate a race-condition, somewhere.
Bisecting the problem came to the following: The change that makes the test fail randomly is this commit: c39c18b
The previous commit (1a13a15) is stable, but c39c18b makes the test fail randomly. The good news is that the 2.9 branch is unaffected. It only happens in the master branch.
I have no idea how the change can have the impact it has. Therefore, I guess it exposes a race-condition to be triggered more steadily after the change.
Maybe someone with more knowledge of that code patch can shed some light on the issue? I think it is a rather pressing issue, with CI also failing randomly on this issue.
strange. I will debug this, but at the moment I'm clueless. Maybe it's just a race condition in the test?
To keep CI from failing that test could be deactivated temporarily in the CI runs.
I can reproduce it. Really strange, but I think I found the issue, but I didn't check yet where the "counterpart" of the race is.
that line seems to be the problem c39c18b#diff-6b63c316bf18ff0c5cf5022c93028c8b01a86675d4c297a0db2c2221575040baR722
stat->currentLine = emcStatus->motion.traj.tag.fields[0];
if I remove that I can't trigger the error any more.