gvegayon/parallel

Child process exited with error 700 when using 2 nodes

mangelett opened this issue · 4 comments

Preliminaries

Before submitting an issue, please check (with x in brackets) that you:

  • Are using the newest release (see here for latest release version number).
  • Have checked that the examples in the help work.
  • Have read the help (HTML version) and the gallery of examples.
  • Have checked that there is not already an existing issues for what you are reporting.

Expected behavior and actual behavior

I'm trying to run the parallel command on two nodes of a HPC cluster using the hostnames option in parallel initialize. When I specify the hostnames, I obtained the error "child process 0002 Exited with error -700- while running the command/dofile (view log)...". The logfile __pll[pll_id]_do0002.log is empty.

The command works fine without the hostnames option (working only on one node).

Steps to reproduce the problem

The following code is saved in the file test_parallel.do:

parallel initialize 2, f h("localhost cn07") 
sysuse auto
parallel, by(foreign) : egen maxp = max(price)

The code is launched with the command stata test_parallel.do inside a SLURM batch file (which request the node cn07").

System information

  • Stata version and flavor (e.g. v14 MP): Stata16-MP
  • OS type and version (e.g. Windows 10): CentOS Linux release 7.5.1804
  • Parallel version: 1.20.0 19mar2019

Output from creturn list:

Working with Slurm can be tricky sometimes. One key issue I've seen in the past is nodes' to filesystems. For parallel to work, all nodes need to have I/O access to the data and tempfiles. This issue seems to be a bug. Thanks for reporting.

Normally, the nodes have IO access to the data and tempfile : data are on a file system shared among the nodes and I set the TMPDIR variable to a folder on this shared file system (originally to not saturate the disk space of node)

Sorry for the late reply. Can you verify that Stata recognizes the TMPDIR variable as the shared path you specified when submitting the jobs?

The command tempfile junk; display "`junk'" prints a tempfile which is in the shared folder that I specified in the TMPDIR variable. So it seems Stata recognizes the shared path. Besides, the logfile __pllul97ezlin1__do0001.log and __pllul97ezlin1__do0002.log are in this folder.