bu-rcs/PkgAutoTest

processes running GUI Tests not terminating in Nextflow

Opened this issue · 6 comments

@bu-bgregor

Hey Brian,

I have some information regarding our GUI tests not terminating:

  1. Adding "killall Xvfb" to the test.qsub does resolve the issue.

  2. I found the following two Github issues related to this behavior:
    nextflow-io/nextflow#3512
    nextflow-io/nextflow#1753 (comment)

From what I gather, the use of the "trap" command in the .command.run file is preventing the processes from terminating properly. This feature seems to originate from a bug of when an SGE job is suspended, nextflow will think the job terminated and end the process. So the fix was to add these trap commands to prevent accidental process termination. So it seems the "trap" command they implemented maybe is ignoring the signal the timeout command is sending.

I am wondering if we could use the "trap" command to overcome this feature in our process code block.

Suggestion from the meeting: edit pkgtest.nf, after the test.qsub file is copied to the working directory tack on this to the end, something like:

echo " killall Xvfb &> /dev/null" >> test.qsub

If Xvfb is not running this causes no harm, if it is running it'll be shut down. This seems easier than trying to fix the Nextflow trap of the USR1 signal.

I ran another test and the results can be found here:
/projectnb/rcstest/milechin/PkgAutoTest/test

It seems the killall Xvfb command is working for most modules except for grass/7.8.3 and openvsp/3.36.0

This time only 40 modules failed in the report (report_module8_list.csv).

I investigated further the modules grass/7.8.3 and openvsp/3.36.0 issue.

For openvsp/3.36.0 issue in the test.qsub file (/share/pkg.8/openvsp/3.36.0/test/test.qsub) there is the following code block:

if [ $TEST_COMPLETE -eq -1 ]
then 
       echo "No test has been performed. TEST_COMPLETE=$TEST_COMPLETE" >&2
       exit 255  # the bash exit code range
else 
       echo "Tests have been performed. TEST_COMPLETE=$TEST_COMPLETE" >&2
       exit 0  # the bash exit code range
fi 

Since Nextflow puts the kill command after this code block, the kill command is never executed and so the process will run until the job terminates. @bu-bgregor this is your module.

For grass/7.8.3, it seems that during the test, one of the python child processes gets stuck in a Z state and the test just hangs. Since it is my module, I will revaluate the test and maybe make it simpler.

openvsp/3.36.0 has been fixed.

Xvfb kill command added by Nextflow:

pgrep -P $$ -f Xvfb | while read line ; do kill -9 \$line; done

@milechin - for grass you can add this to your test.qsub:

pkill -P $$

or to use the equivalent of "kill -9":

pkill -P $$ --signal 9