Child process exited with error 700 when using 2 nodes #93

mangelett · 2021-07-28T09:57:37Z

Preliminaries

Before submitting an issue, please check (with x in brackets) that you:

Are using the newest release (see here for latest release version number).
Have checked that the examples in the help work.
Have read the help (HTML version) and the gallery of examples.
Have checked that there is not already an existing issues for what you are reporting.

Expected behavior and actual behavior

I'm trying to run the parallel command on two nodes of a HPC cluster using the hostnames option in parallel initialize. When I specify the hostnames, I obtained the error "child process 0002 Exited with error -700- while running the command/dofile (view log)...". The logfile __pll[pll_id]_do0002.log is empty.

The command works fine without the hostnames option (working only on one node).

Steps to reproduce the problem

The following code is saved in the file test_parallel.do:

parallel initialize 2, f h("localhost cn07") 
sysuse auto
parallel, by(foreign) : egen maxp = max(price)

The code is launched with the command stata test_parallel.do inside a SLURM batch file (which request the node cn07").

System information

Stata version and flavor (e.g. v14 MP): Stata16-MP
OS type and version (e.g. Windows 10): CentOS Linux release 7.5.1804
Parallel version: 1.20.0 19mar2019

Output from `creturn list`:

The text was updated successfully, but these errors were encountered:

gvegayon · 2021-09-22T18:01:59Z

Working with Slurm can be tricky sometimes. One key issue I've seen in the past is nodes' to filesystems. For parallel to work, all nodes need to have I/O access to the data and tempfiles. This issue seems to be a bug. Thanks for reporting.

mangelett · 2021-09-23T07:00:48Z

Normally, the nodes have IO access to the data and tempfile : data are on a file system shared among the nodes and I set the TMPDIR variable to a folder on this shared file system (originally to not saturate the disk space of node)

gvegayon · 2021-11-08T18:47:28Z

Sorry for the late reply. Can you verify that Stata recognizes the TMPDIR variable as the shared path you specified when submitting the jobs?

mangelett · 2021-11-09T08:27:15Z

The command tempfile junk; display "`junk'" prints a tempfile which is in the shared folder that I specified in the TMPDIR variable. So it seems Stata recognizes the shared path. Besides, the logfile __pllul97ezlin1__do0001.log and __pllul97ezlin1__do0002.log are in this folder.

gvegayon self-assigned this Sep 22, 2021

gvegayon added the bug label Sep 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Child process exited with error 700 when using 2 nodes #93

Child process exited with error 700 when using 2 nodes #93

mangelett commented Jul 28, 2021

gvegayon commented Sep 22, 2021

mangelett commented Sep 23, 2021

gvegayon commented Nov 8, 2021

mangelett commented Nov 9, 2021

Child process exited with error 700 when using 2 nodes #93

Child process exited with error 700 when using 2 nodes #93

Comments

mangelett commented Jul 28, 2021

Preliminaries

Expected behavior and actual behavior

Steps to reproduce the problem

System information

Output from creturn list:

gvegayon commented Sep 22, 2021

mangelett commented Sep 23, 2021

gvegayon commented Nov 8, 2021

mangelett commented Nov 9, 2021

Output from `creturn list`: