Workers interrupted prematurely with CLI Stata on Linux
leitec opened this issue · 9 comments
Preliminaries
Before submitting an issue, please check (with x
in brackets) that you:
- Are using the newest release (see here for latest release version number).
- Have checked that the examples in the help work.
- Have read the help (HTML version) and the gallery of examples.
- Have checked that there is not already an existing issues for what you are reporting.
Expected behavior and actual behavior
On a Linux system, running parallel from an interactive Stata CLI session causes the worker processes to return immediately after parallel_run
spawns them. Whatever work is interrupted when the main process attempts to clean up the workers and their temp directories.
This does not happen if using XStata or when running CLI Stata noninteractively.
The problem can be mitigated by adding wait
as the last line of the script produced by parallel_run
. This will wait for the backgrounded processes to fully complete. I do not know if this is the correct behavior, though. I do know the reported timing in timers 97, 98 and 99 match before and after with XStata so it does not appear to cause any harm to what was already working in this simple example case, at least.
Steps to reproduce the problem
- Run
stata
in an interactive terminal on Linux - Run the bootstrapping example in the README
- Observe that the four processes exit with error 700, and the finito files report an error 693
System information
Some relevant information
- Stata version and flavor (e.g. v14 MP): fails with 15.1 SE/MP and 16.1 SE/MP
- OS type and version (e.g. Windows 10): Ubuntu 18.04
- Parallel version: Git master as of 11/25/2020
Output from creturn list
:
Is it possible to narrow this down to specific sections or sets of values you might need?
Thanks for submitting this and tracking down a likely fix. Are you able to fix the code so that it works for you? (For building the package see compile.do
and if you want to install there's also compile_and_install.do
) If so, could you submit a PR? The code currently writes a script for both Mac + Linux, so we'll have to think about whether the fix should apply to Mac also. Maybe for now, just make it Linux specific. For Mac, we might have the related issue #85.
Yes, I have tested it successfully. I'll change it to apply only to Linux and then submit a PR.
I just want to make sure I understand it correctly: parallel's expectation is that the backgrounded child processes will run and complete before returning control to Stata, right? That seems to be effectively what is happening when it works correctly, but it might not be what you want if e.g. one of the processes fails but the others continue running.
Thanks!
Thanks, George. That makes more sense.
I saw some of that while troubleshooting but I noticed that when it worked, the behavior appeared to be the same as if I added wait
to the script. I had it print out the PIDs (in this loop
parallel/ado/parallel_run.mata
Line 99 in e8de0a9
But given that waiting to fully complete is not the expected behavior, wait
is definitely not the solution. I will have a look again tomorrow, perhaps with a longer-running script and with a timer around the shell
step.
It looks like there are two different issues.
The original issue I had about the workers terminating prematurely is due to procwait
not working properly with a csh
family shell. The 2>/dev/null
syntax to redirect stderr does not work in that case. Parallel interprets the result as the processes being done. (which might itself be a bug) One solution is to switch to the >&
syntax that works on at least bash
, zsh
, and tcsh
:
shell `connection' kill -0 `anything' >&/dev/null; echo \$? > "`kill_exit_code'"
This syntax does not work with ksh
and plainer sh
varieties, though. I am not sure what the right approach here would be. I would imagine bash
and zsh
cover the vast majority of users, and most C shell users today would be using tcsh
. Debian/Ubuntu does use dash
as its default /bin/sh
but user shells are bash
. It doesn't seem like a huge source of concern. What do you think?
I tried calling a second shell (i.e. running the whole thing through /bin/sh -c
) but that noticeably increased the execution time of the bootstrap. It actually increased a little even with the single shell, compared to when it was waiting for the child processes to exit. Ideally, there would be a way to check process status without the overhead of spawning a shell and using a temp file. How onerous is it to use a plug-in on Linux/Mac, too?
The second issue is that I confirmed the current code (without wait
) does wait for the child processes to complete before returning to Stata if you are running it noninteractively or in XStata. I added a timer around the shell
command and it takes up the vast majority of execution time even on long-running scripts. The reason for this is that the stderr of the Stata child processes remains open, and Stata in these modes does not return until it is closed. The solution here is to just add 2>&1
to the Stata commands in the shell script to redirect stderr to the existing log files.
I can submit a PR with both fixes, but I'll wait for your take on the shell compatibility limitation. Thanks.
Thanks for looking into these. As for shell compatibility, I do wish there was a better way, but don't know of one. If your solution covers a superset of shells, then I'd say that's good. We could see how easy it is to switch shells to run this command. If all it takes is changing the environment variable, then at launch one can do SHELL=/bin/bash xstata ...
. To modify the env variable once we're running, we can use procenv
on Windows, but we'd need a Mac/Linux solution. But I'd say let's stick with what's easiest, which is likely your fix.
Sure, happy to help. Thank you for the very useful package.
Yes, all that it takes is changing the SHELL variable prior to running Stata. I used that to test the different shells.