os-autoinst test failures with v0.31

Question

os-autoinst test failures with v0.31

Opened this issue 3 years ago · 6 comments

Since version 0.31 we are seeing the following test failures"

[  212s] 3: 
[  212s] 3: #   Failed test 'exit logged'
[  212s] 3: #   at ./29-backend-driver.t line 31.
[  212s] 3: # STDOUT & STDERR:
[  212s] 3: # [37m[2021-12-06T17:11:16.953486Z] [debug] received magic close
[  212s] 3: # [0m
[  212s] 3: # don't match:
[  212s] 3: # (?^u:backend.*exited)
[  212s] 3: # as expected
[  212s] 3: [37m[2021-12-06T17:11:17.057808Z] [debug] backend process exited: 72057594037927935
[  212s] 3: [0m# Looks like you failed 1 test of 12.
[  212s] 3: [17:11:17] ./29-backend-driver.t .................... 
[  212s] 3: Dubious, test returned 1 (wstat 256, 0x100)
[  212s] 3: Failed 1/12 subtests

Also we see this:

Can't open out pipe for reading Interrupted system call at ./10-virtio_terminal.t line 44.

https://github.com/os-autoinst/os-autoinst/blob/master/t/29-backend-driver.t#L31

More details: https://progress.opensuse.org/issues/103422#note-20

Answer 1 · 2021-12-06T19:58:23.000Z

So it appears that the backend process exited: ... output appears when calling $driver->stop, one line below.

Answer 2 · 2021-12-07T11:03:21.000Z

@cfconrad do you have any ideas of what could be happening?

Answer 3 · 2021-12-07T13:29:41.000Z

This also leads to the worker not noticing when isotovideo exists, see https://progress.opensuse.org/issues/103605. The problematic commit is most likely 1e0addb. It would make most sense to temporarily block signals to avoid the race condition.

Answer 4 · 2021-12-07T13:32:05.000Z

I have bisected the problem with t/29-backend-driver.t from os-autoinst, and the responsible commit is indeed 1e0addb.

Answer 5 · 2021-12-07T14:53:06.000Z

The change postponed the collection of died children out of the signal handler. And $driver->stop_backend() just send some message to a process. What we check is the output of our current process and this is generated, once we hit the consume_collected_info() which is done e.g. in is_running()

This could be a patch for this specific test. But maybe there are other things where we relay on that unsynchronous calls

diff --git a/t/29-backend-driver.t b/t/29-backend-driver.t
index 6738ced3..bf7cf485 100755
--- a/t/29-backend-driver.t
+++ b/t/29-backend-driver.t
@@ -28,8 +28,10 @@ isnt $driver->{backend_process}, {}, 'backend process was started' or explain $d
 is $driver->extract_assets, undef, 'extract_assets';
 ok $driver->start_vm, 'start_vm';
 is $driver->mouse_hide, 0, 'mouse_hide';
-combined_like { is $driver->stop_backend, undef, 'stop_backend' } qr/backend.*exited/, 'exit logged';
-is $driver->stop, undef, 'stop';
+combined_like { 
+    is $driver->stop_backend, undef, 'stop_backend'; 
+    is $driver->stop, undef, 'stop';
+} qr/backend.*exited/, 'exit logged';
 done_testing;
 
 1;

I'm sorry for being so late in that discussion. I still think doing the work out of the signal handlers makes sense. I see that perl signalhandlers also try to make things more reliable by not interrupting the program at any point. But still, having clean points where we reap/collect the child's and doing things synchronous makes the code much more easy to debug/maintain/understand.

Answer 6 · 2021-12-08T08:22:44.000Z

Thx @Martchus for your approaches. Lets make the discussion in here, as we discuss about multiple PR's and it may be messy if we comment in each of them.

#31

From my point of view, the orphan collection is one highlight, as it might show you unhandled "unknown" processes in your
program.
I think with some more code on top, you could actually collect orphans as well, you just need to avoid the events while you
are in that fork/open3 phase. AFAICS it is the same as #33

#32

We are mixing "perl signalhandlers" with real UNIX blocking :), but I think this should not be a problem
As you already mentioned we have the problem with SIGCHLD blocked in the child. We might try to avoid this, by replacing _open() branch with a sane fork()+exec(), as we already do the redirection of stderr|stdout|stdin.

#33

I don't like the default value of emit_from_sigchld() it should be 0. As doing such things from an deferred code context is in my opinion wrong.
We emit the events sometimes from sighandler and sometimes not, I think it is a bad idea!
I see here also a problem with the _open() branch, as open3() dies if the program doesn't exists. And we would not reset the emit_from_sigchld() in such case. I think we can avoid this by running open3() in an eval{} block and forward the exception after resetting the emit_from_sigchld() value. But I'm not sure if such patterns in perl exists.