-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLURM gpus-per-task issue #1102
Comments
I don't have any immediate insights into this, though I'll try to experiment with some parameters and see if I can duplicate the issue in the next few days. However, because this is a Slurm-specific issue, I'd definitely recommend posting to the slurm-users mailing list. That's the most likely path to get insights from experts specifically in Slurm. |
I will post there as well. Thank you for the hint. FYI. Since the initial post I have upgraded hwloc (now 2.7.0), pmxi (now 4.1.0), and openmpi (now 4.1.2), and
I changed it, because I have noticed errors of the form
in
The earlier mentioned What has changed with the above mentioned upgrades is the output in the case
which now gives the following output
|
@ajdecon have you perhaps had the chance to check if you can duplicate the issue? I have tried to post to the slurm-users mailing list, but it is no longer active, it is a read only archive. I have posted to slurm bugzilla, but until now I have not received any reply since the bug is not backed up by funding. |
FWIW, while the Google group is read-only, the slurm-users group is still active and available via SchedMD's mailman interface: https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users Unfortunately, I can't duplicate your issue in my local testing. Running Slurm 21.08.5 on my development cluster, I see this:
Note the different UUIDs, indicating that each task is getting its own GPU (but they show up as Similarly,
Some config files that may be relevant...
And from
|
@ajdecon thank you for your help; in the end the issue seems to have been caused by misinterpretation of the parameter name and The equivalent call for DGX-A100 with Cores specified in
without
System with all GPUs at Cores=0-15:
|
It's not solved, after all, it was just hiding :( seems to be related to running via a container (enroot v3.2.0, pyxis v0.11.1): $ srun --nodes=1 --tasks=2 --gpus-per-task=1 --container-image nvcr.io#nvidia/pytorch:22.02-py3 bash -c 'echo "PROC_ID=$SLURM_PROCID $(nvidia-smi -L)"'
pyxis: importing docker image ...
PROC_ID=0 No devices found.
PROC_ID=1 GPU 0: NVIDIA RTX A5000 (UUID: GPU-43f1fb2e-11d8-30c4-1b4a-70bd77fc7e54) vs no container $ srun --nodes=1 --tasks=2 --gpus-per-task=1 bash -c 'echo "PROC_ID=$SLURM_PROCID $(nvidia-smi -L)"'
PROC_ID=0 GPU 0: NVIDIA RTX A5000 (UUID: GPU-80b6f32b-92a5-8495-5438-993f0d99d14b)
PROC_ID=1 GPU 0: NVIDIA RTX A5000 (UUID: GPU-43f1fb2e-11d8-30c4-1b4a-70bd77fc7e54) |
Just for completeness, the issue is present even when running with $srun --nodes=1 --tasks=2 --gres=gpu:2 --gpu-bind=verbose,per_task:1 --container-image nvcr.io#nvidia/pytorch:22.02-py3 bash -c 'echo "PROC_ID=$SLURM_PROCID $(nvidia-smi -L)"'
gpu-bind: usable_gres=0x1; bit_alloc=0x5; local_inx=2; global_list=0; local_list=0
pyxis: importing docker image ...
gpu-bind: usable_gres=0x1; bit_alloc=0x5; local_inx=2; global_list=0; local_list=0
PROC_ID=0 No devices found.
PROC_ID=1 GPU 0: NVIDIA RTX A5000 (UUID: GPU-7fd8c9f3-0360-7e15-cf0c-49d2fc79a046)
$ srun --nodes=1 --tasks=2 --gres=gpu:2 --gpu-bind=verbose,single:1 --container-image nvcr.io#nvidia/pytorch:22.02-py3 bash -c 'echo "PROC_ID=$SLURM_PROCID $(nvidia-smi -L)"'
gpu-bind: usable_gres=0x2; bit_alloc=0x5; local_inx=2; global_list=2; local_list=1
gpu-bind: usable_gres=0x1; bit_alloc=0x5; local_inx=2; global_list=0; local_list=0
pyxis: importing docker image ...
PROC_ID=0 No devices found.
PROC_ID=1 GPU 0: NVIDIA RTX A5000 (UUID: GPU-7fd8c9f3-0360-7e15-cf0c-49d2fc79a046) |
The
and without it reads
|
@itzsimpl : Based on reading through NVIDIA/pyxis#73, it looks like your issue is addressed by setting Are you good to close the DeepOps issue? Or do you think there's still something to address here? |
Yes, we can close this issue. Thank you for helping me track it down to the source. |
I am writing this here as my SLURM cluster was initially deployed with deepops. It used to work as it should. After an upgrade to Ubuntu 20.04 and DGX OS 5 I had to upgrade also SLURM from 20.02 to 21.08.5. This is when certain SBATCH parameters started misbehaving.
More details of my case. Running with
will result in the processes sharing a single GPU
This I suppose is correct.
Running with
will result in only the last process receiving a GPU
Note different IDs in consecutive runs
However, in this case each process should receive its own GPU.
Running with
will result in each process having access to all 4 GPUs
This also is, I suppose, correct.
Running with
will again result in only the last process receiving a GPU
In this case again each task should receive its own GPU, but it does not.
Running with verbose binding information
will give the following output
I have found a related 9 month old stackoverflow question, but apart from my post there is no response. Any ideas to what could be causing this issue?
The text was updated successfully, but these errors were encountered: