Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container does not start on a small set on cluster #122

Open
lidavid88 opened this issue Aug 24, 2023 · 2 comments
Open

Container does not start on a small set on cluster #122

lidavid88 opened this issue Aug 24, 2023 · 2 comments

Comments

@lidavid88
Copy link

I have discovered a problem on running container on a cluster.

I am using a nvidia pytorch container created with enroot in the following submit script:

#!/usr/bin/env bash

#SBATCH --time=03:00:00
#SBATCH --gres=gpu:4
#SBATCH --nodes=8
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=4
#SBATCH --mem=501600mb

ml purge

run_num='0'

SRUN_PARAMS=(
  --mpi="pmi2"
  --gpus-per-task=1
  --gpu-bind="closest"
  --label
  --container-name=fcn
  --container-mounts=/etc/slurm/task_prolog.hk:/etc/slurm/task_prolog.hk,/scratch:/scratch,/hkfs/work/workspace/scratch/usr1234,/tmp,/usr/bin/srun:/usr/bin/srun
  --container-mount-home
  --container-writable
  --no-container-entrypoint
)

srun "${SRUN_PARAMS[@]}" bash -c "
  echo $run_num
"

On most nodes srun is executed and I get 0 printed to the log.

But on the other nodes I get 2 types of errors:

22: slurmstepd: error: pyxis: container start failed with error code: 1
22: slurmstepd: error: pyxis: printing enroot log file:
22: slurmstepd: error: pyxis:     /etc/enroot/hooks.d/10-shadow.sh: line 70: 3474706 Broken pipe             yes 2> /dev/null
22: slurmstepd: error: pyxis:          3474707 Segmentation fault      (core dumped) | pwck -R "${ENROOT_ROOTFS}" "${pwddb#${ENROOT_ROOTFS}}" /etc/shadow > /dev/null 2>&1
22: slurmstepd: error: pyxis:     nvidia-container-cli: ldcache error: process /usr/sbin/ldconfig failed with error code: 1
22: slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
22: slurmstepd: error: pyxis: couldn't start container
22: slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
22: slurmstepd: error: Failed to invoke spank plugin stack

21: slurmstepd: error: pyxis: couldn't start container
21: slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
21: slurmstepd: error: Failed to invoke spank plugin stack

This error does not appear, if I only use up to 4 nodes.

With 8 nodes the job works, if I am lucky. But most of the time I get errors on some nodes.

My guess is that the inter node communication is having troubles with pyxis.

Can someone help me with that?

Regards

@flx42
Copy link
Member

flx42 commented Aug 24, 2023

It the issue happens 0% of the time on some nodes and 100% of the time on some nodes, I suggest you start investigating the differences between the good nodes and the bad nodes:

  • Is it the same distro, Linux version, NVIDIA driver version?
  • Is it the same enroot version? Perhaps try to reinstall enroot on the bad nodes.
  • Check dmesg and the slurmd log on the bad nodes for any clue.

@lidavid88
Copy link
Author

They seem to have the same versions and drivers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants