os-autoinst test failures with v0.31
Opened this issue · 6 comments
Since version 0.31 we are seeing the following test failures"
[ 212s] 3:
[ 212s] 3: # Failed test 'exit logged'
[ 212s] 3: # at ./29-backend-driver.t line 31.
[ 212s] 3: # STDOUT & STDERR:
[ 212s] 3: # [37m[2021-12-06T17:11:16.953486Z] [debug] received magic close
[ 212s] 3: # [0m
[ 212s] 3: # don't match:
[ 212s] 3: # (?^u:backend.*exited)
[ 212s] 3: # as expected
[ 212s] 3: [37m[2021-12-06T17:11:17.057808Z] [debug] backend process exited: 72057594037927935
[ 212s] 3: [0m# Looks like you failed 1 test of 12.
[ 212s] 3: [17:11:17] ./29-backend-driver.t ....................
[ 212s] 3: Dubious, test returned 1 (wstat 256, 0x100)
[ 212s] 3: Failed 1/12 subtests
Also we see this:
Can't open out pipe for reading Interrupted system call at ./10-virtio_terminal.t line 44.
https://github.com/os-autoinst/os-autoinst/blob/master/t/29-backend-driver.t#L31
More details: https://progress.opensuse.org/issues/103422#note-20
So it appears that the backend process exited: ...
output appears when calling $driver->stop
, one line below.
@cfconrad do you have any ideas of what could be happening?
This also leads to the worker not noticing when isotovideo exists, see https://progress.opensuse.org/issues/103605. The problematic commit is most likely 1e0addb. It would make most sense to temporarily block signals to avoid the race condition.
I have bisected the problem with t/29-backend-driver.t
from os-autoinst, and the responsible commit is indeed 1e0addb.
The change postponed the collection of died children out of the signal handler. And $driver->stop_backend()
just send some message to a process. What we check is the output of our current process and this is generated, once we hit the consume_collected_info()
which is done e.g. in is_running()
This could be a patch for this specific test. But maybe there are other things where we relay on that unsynchronous calls
diff --git a/t/29-backend-driver.t b/t/29-backend-driver.t
index 6738ced3..bf7cf485 100755
--- a/t/29-backend-driver.t
+++ b/t/29-backend-driver.t
@@ -28,8 +28,10 @@ isnt $driver->{backend_process}, {}, 'backend process was started' or explain $d
is $driver->extract_assets, undef, 'extract_assets';
ok $driver->start_vm, 'start_vm';
is $driver->mouse_hide, 0, 'mouse_hide';
-combined_like { is $driver->stop_backend, undef, 'stop_backend' } qr/backend.*exited/, 'exit logged';
-is $driver->stop, undef, 'stop';
+combined_like {
+ is $driver->stop_backend, undef, 'stop_backend';
+ is $driver->stop, undef, 'stop';
+} qr/backend.*exited/, 'exit logged';
done_testing;
1;
I'm sorry for being so late in that discussion. I still think doing the work out of the signal handlers makes sense. I see that perl signalhandlers also try to make things more reliable by not interrupting the program at any point. But still, having clean points where we reap/collect the child's and doing things synchronous makes the code much more easy to debug/maintain/understand.
Thx @Martchus for your approaches. Lets make the discussion in here, as we discuss about multiple PR's and it may be messy if we comment in each of them.
- From my point of view, the orphan collection is one highlight, as it might show you unhandled "unknown" processes in your
program. - I think with some more code on top, you could actually collect orphans as well, you just need to avoid the events while you
are in that fork/open3 phase. AFAICS it is the same as #33
- We are mixing "perl signalhandlers" with real UNIX blocking :), but I think this should not be a problem
- As you already mentioned we have the problem with SIGCHLD blocked in the child. We might try to avoid this, by replacing
_open()
branch with a sanefork()+exec()
, as we already do the redirection ofstderr|stdout|stdin
.
- I don't like the default value of
emit_from_sigchld()
it should be 0. As doing such things from an deferred code context is in my opinion wrong. - We emit the events sometimes from sighandler and sometimes not, I think it is a bad idea!
- I see here also a problem with the
_open()
branch, asopen3()
dies if the program doesn't exists. And we would not reset theemit_from_sigchld()
in such case. I think we can avoid this by runningopen3()
in aneval{}
block and forward the exception after resetting theemit_from_sigchld()
value. But I'm not sure if such patterns in perl exists.