Using docker image on cluster gives error #3292

razvangamanut · 2024-08-16T10:41:25Z

Describe the bug
I ran on the University HPC a python script using NEST 3.8 docker image with OpenMPI and Singularity. When I used slurm, it gave errors.

To Reproduce
Steps to reproduce the behavior:

(Minimal) reproducing example

The main commands in the slurm file were:

module load openmpi.gcc/4.0.3
module load singularity

srun --mpi=pmix singularity run ./nest.sif python3 simulation.py
[or]
mpirun -n 8 singularity run ./nest.sif python3 simulation.py

Screenshots
[when using srun --mpi=pmix] : PMIX ERROR: ERROR in file ../../../../../../src/mca/gds/ds12/gds_ds12_lock_pthread.c at line 169

[when using mpirun -n 8]:

It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

getting local rank failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS

It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

orte_ess_init failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS

It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "No permission" (-17) instead of "Success" (0)

* An error occurred in MPI_Init_thread
* on a NULL communicator
* MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
* and potentially your MPI job)
Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[28962,1],0]
Exit code: 1

Desktop/Environment (please complete the following information):

NEST-Version: NEST 3.8 docker image

terhorstd · 2024-08-23T07:46:50Z

my apologies, closed accidentally

gtrensch · 2024-08-23T14:15:54Z

@razvangamanut, the problem is most likely caused by an incompatibility of the MPI libraries. We are afraid that for your HPC system, NEST needs to be built from source. This is necessary to include the system-specific MPI libraries. Currently, we do not know how to properly handle external site-specific MPI setups, especially for HPC systems. You may also contact the administrators of your HPC system if they know of a solution. We would also be very interested in such expertise.

hamannju · 2024-10-18T21:27:57Z

Hello, I had a similar issue with Nest / OpenMPI compatibility while getting some old code to run to help a PhD student. We worked on the Imperial College London cluster and there you are able to utilize conda environments from the user profile, so we built a version of Nest 2.20.2 into a clean conda environment and that worked well with MPI etc.

This is the repo with the code: https://github.com/hamannju/anaconda-nest

It was a little messy, but if the user can bring his/her own conda env, then it runs on an HPC cluster.

github-actions · 2024-12-18T08:38:42Z

Issue automatically marked stale!

jessica-mitchell added this to Installation Aug 20, 2024

github-project-automation bot moved this to To do in Installation Aug 20, 2024

gtrensch added T: Bug Wrong statements in the code or documentation S: Normal Handle this with default priority and removed T: Bug Wrong statements in the code or documentation S: Normal Handle this with default priority labels Aug 23, 2024

terhorstd closed this as completed Aug 23, 2024

github-project-automation bot moved this from To do to Done in Installation Aug 23, 2024

gtrensch self-assigned this Aug 23, 2024

terhorstd reopened this Aug 23, 2024

github-project-automation bot moved this from Done to In progress in Installation Aug 23, 2024

github-actions bot added the stale Automatic marker for inactivity, please have another look here label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using docker image on cluster gives error #3292

Using docker image on cluster gives error #3292

razvangamanut commented Aug 16, 2024 •

edited

Loading

terhorstd commented Aug 23, 2024

gtrensch commented Aug 23, 2024

hamannju commented Oct 18, 2024

github-actions bot commented Dec 18, 2024

Using docker image on cluster gives error #3292

Using docker image on cluster gives error #3292

Comments

razvangamanut commented Aug 16, 2024 • edited Loading

getting local rank failed --> Returned value No permission (-17) instead of ORTE_SUCCESS

orte_ess_init failed --> Returned value No permission (-17) instead of ORTE_SUCCESS

ompi_mpi_init: ompi_rte_init failed --> Returned "No permission" (-17) instead of "Success" (0)

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

terhorstd commented Aug 23, 2024

gtrensch commented Aug 23, 2024

hamannju commented Oct 18, 2024

github-actions bot commented Dec 18, 2024

razvangamanut commented Aug 16, 2024 •

edited

Loading

getting local rank failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS

orte_ess_init failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS

ompi_mpi_init: ompi_rte_init failed
--> Returned "No permission" (-17) instead of "Success" (0)

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.