Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Child process exited with error 700 when using 2 nodes #93

Open
4 tasks done
mangelett opened this issue Jul 28, 2021 · 4 comments
Open
4 tasks done

Child process exited with error 700 when using 2 nodes #93

mangelett opened this issue Jul 28, 2021 · 4 comments
Assignees
Labels

Comments

@mangelett
Copy link

Preliminaries

Before submitting an issue, please check (with x in brackets) that you:

  • Are using the newest release (see here for latest release version number).
  • Have checked that the examples in the help work.
  • Have read the help (HTML version) and the gallery of examples.
  • Have checked that there is not already an existing issues for what you are reporting.

Expected behavior and actual behavior

I'm trying to run the parallel command on two nodes of a HPC cluster using the hostnames option in parallel initialize. When I specify the hostnames, I obtained the error "child process 0002 Exited with error -700- while running the command/dofile (view log)...". The logfile __pll[pll_id]_do0002.log is empty.

The command works fine without the hostnames option (working only on one node).

Steps to reproduce the problem

The following code is saved in the file test_parallel.do:

parallel initialize 2, f h("localhost cn07") 
sysuse auto
parallel, by(foreign) : egen maxp = max(price)

The code is launched with the command stata test_parallel.do inside a SLURM batch file (which request the node cn07").

System information

  • Stata version and flavor (e.g. v14 MP): Stata16-MP
  • OS type and version (e.g. Windows 10): CentOS Linux release 7.5.1804
  • Parallel version: 1.20.0 19mar2019

Output from creturn list:

@gvegayon gvegayon self-assigned this Sep 22, 2021
@gvegayon gvegayon added the bug label Sep 22, 2021
@gvegayon
Copy link
Owner

Working with Slurm can be tricky sometimes. One key issue I've seen in the past is nodes' to filesystems. For parallel to work, all nodes need to have I/O access to the data and tempfiles. This issue seems to be a bug. Thanks for reporting.

@mangelett
Copy link
Author

Normally, the nodes have IO access to the data and tempfile : data are on a file system shared among the nodes and I set the TMPDIR variable to a folder on this shared file system (originally to not saturate the disk space of node)

@gvegayon
Copy link
Owner

gvegayon commented Nov 8, 2021

Sorry for the late reply. Can you verify that Stata recognizes the TMPDIR variable as the shared path you specified when submitting the jobs?

@mangelett
Copy link
Author

The command tempfile junk; display "`junk'" prints a tempfile which is in the shared folder that I specified in the TMPDIR variable. So it seems Stata recognizes the shared path. Besides, the logfile __pllul97ezlin1__do0001.log and __pllul97ezlin1__do0002.log are in this folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants