You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It the issue happens 0% of the time on some nodes and 100% of the time on some nodes, I suggest you start investigating the differences between the good nodes and the bad nodes:
Is it the same distro, Linux version, NVIDIA driver version?
Is it the same enroot version? Perhaps try to reinstall enroot on the bad nodes.
Check dmesg and the slurmd log on the bad nodes for any clue.
I have discovered a problem on running container on a cluster.
I am using a nvidia pytorch container created with enroot in the following submit script:
On most nodes srun is executed and I get 0 printed to the log.
But on the other nodes I get 2 types of errors:
This error does not appear, if I only use up to 4 nodes.
With 8 nodes the job works, if I am lucky. But most of the time I get errors on some nodes.
My guess is that the inter node communication is having troubles with pyxis.
Can someone help me with that?
Regards
The text was updated successfully, but these errors were encountered: