Using docker image on cluster gives error #3292
Labels
I: No breaking change
Previously written code will work as before, no one should note anything changing (aside the fix)
P: Won't fix
No one will work on this in the near future. See comments for details
S: Normal
Handle this with default priority
stale
Automatic marker for inactivity, please have another look here
T: External bug
Not an issue that can be solved here. (May need documentation, though)
Describe the bug
I ran on the University HPC a python script using NEST 3.8 docker image with OpenMPI and Singularity. When I used slurm, it gave errors.
To Reproduce
Steps to reproduce the behavior:
The main commands in the slurm file were:
module load openmpi.gcc/4.0.3
module load singularity
srun --mpi=pmix singularity run ./nest.sif python3 simulation.py
[or]
mpirun -n 8 singularity run ./nest.sif python3 simulation.py
Screenshots
[when using srun --mpi=pmix] : PMIX ERROR: ERROR in file ../../../../../../src/mca/gds/ds12/gds_ds12_lock_pthread.c at line 169
[when using mpirun -n 8]:
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
getting local rank failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_init failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "No permission" (-17) instead of "Success" (0)
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[28962,1],0]
Exit code: 1
Desktop/Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: