Container does not start on a small set on cluster #122

lidavid88 · 2023-08-24T12:42:16Z

I have discovered a problem on running container on a cluster.

I am using a nvidia pytorch container created with enroot in the following submit script:

#!/usr/bin/env bash

#SBATCH --time=03:00:00
#SBATCH --gres=gpu:4
#SBATCH --nodes=8
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=4
#SBATCH --mem=501600mb

ml purge

run_num='0'

SRUN_PARAMS=(
  --mpi="pmi2"
  --gpus-per-task=1
  --gpu-bind="closest"
  --label
  --container-name=fcn
  --container-mounts=/etc/slurm/task_prolog.hk:/etc/slurm/task_prolog.hk,/scratch:/scratch,/hkfs/work/workspace/scratch/usr1234,/tmp,/usr/bin/srun:/usr/bin/srun
  --container-mount-home
  --container-writable
  --no-container-entrypoint
)

srun "${SRUN_PARAMS[@]}" bash -c "
  echo $run_num
"

On most nodes srun is executed and I get 0 printed to the log.

But on the other nodes I get 2 types of errors:

22: slurmstepd: error: pyxis: container start failed with error code: 1
22: slurmstepd: error: pyxis: printing enroot log file:
22: slurmstepd: error: pyxis:     /etc/enroot/hooks.d/10-shadow.sh: line 70: 3474706 Broken pipe             yes 2> /dev/null
22: slurmstepd: error: pyxis:          3474707 Segmentation fault      (core dumped) | pwck -R "${ENROOT_ROOTFS}" "${pwddb#${ENROOT_ROOTFS}}" /etc/shadow > /dev/null 2>&1
22: slurmstepd: error: pyxis:     nvidia-container-cli: ldcache error: process /usr/sbin/ldconfig failed with error code: 1
22: slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
22: slurmstepd: error: pyxis: couldn't start container
22: slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
22: slurmstepd: error: Failed to invoke spank plugin stack

21: slurmstepd: error: pyxis: couldn't start container
21: slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
21: slurmstepd: error: Failed to invoke spank plugin stack

This error does not appear, if I only use up to 4 nodes.

With 8 nodes the job works, if I am lucky. But most of the time I get errors on some nodes.

My guess is that the inter node communication is having troubles with pyxis.

Can someone help me with that?

Regards

The text was updated successfully, but these errors were encountered:

flx42 · 2023-08-24T16:30:03Z

It the issue happens 0% of the time on some nodes and 100% of the time on some nodes, I suggest you start investigating the differences between the good nodes and the bad nodes:

Is it the same distro, Linux version, NVIDIA driver version?
Is it the same enroot version? Perhaps try to reinstall enroot on the bad nodes.
Check dmesg and the slurmd log on the bad nodes for any clue.

lidavid88 · 2023-08-25T11:31:26Z

They seem to have the same versions and drivers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container does not start on a small set on cluster #122

Container does not start on a small set on cluster #122

lidavid88 commented Aug 24, 2023

flx42 commented Aug 24, 2023

lidavid88 commented Aug 25, 2023

Container does not start on a small set on cluster #122

Container does not start on a small set on cluster #122

Comments

lidavid88 commented Aug 24, 2023

flx42 commented Aug 24, 2023

lidavid88 commented Aug 25, 2023