Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLURM gpus-per-task issue #1102

Closed
itzsimpl opened this issue Feb 3, 2022 · 10 comments
Closed

SLURM gpus-per-task issue #1102

itzsimpl opened this issue Feb 3, 2022 · 10 comments
Assignees

Comments

@itzsimpl
Copy link
Contributor

itzsimpl commented Feb 3, 2022

I am writing this here as my SLURM cluster was initially deployed with deepops. It used to work as it should. After an upgrade to Ubuntu 20.04 and DGX OS 5 I had to upgrade also SLURM from 20.02 to 21.08.5. This is when certain SBATCH parameters started misbehaving.

More details of my case. Running with

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gres=gpu:1

will result in the processes sharing a single GPU

"PROCID=2: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"
"PROCID=1: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"
"PROCID=0: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"
"PROCID=3: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"

This I suppose is correct.

Running with

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gpus-per-task=1

will result in only the last process receiving a GPU

"PROCID=2: No devices found."
"PROCID=3: No devices found."
"PROCID=0: No devices found."
"PROCID=1: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2)"

Note different IDs in consecutive runs

"PROCID=2: No devices found."
"PROCID=1: No devices found."
"PROCID=3: No devices found."
"PROCID=0: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"

However, in this case each process should receive its own GPU.

Running with

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gres=gpu:4

will result in each process having access to all 4 GPUs

"PROCID=3: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"
"PROCID=1: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"
"PROCID=2: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"
"PROCID=0: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"

This also is, I suppose, correct.

Running with

#SBATCH --ntasks=4
#SBATCH --gres=gpu:4
#SBATCH --gpu-bind=single:1

will again result in only the last process receiving a GPU

"PROCID=1: No devices found."
"PROCID=0: No devices found."
"PROCID=3: No devices found."
"PROCID=2: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0)"

In this case again each task should receive its own GPU, but it does not.

Running with verbose binding information

#SBATCH --ntasks=4
#SBATCH --gres=gpu:4
#SBATCH --gpu-bind=verbose,single:1

will give the following output

gpu-bind: usable_gres=0x1; bit_alloc=0xF; local_inx=4; global_list=0; local_list=0
gpu-bind: usable_gres=0x2; bit_alloc=0xF; local_inx=4; global_list=1; local_list=1
gpu-bind: usable_gres=0x4; bit_alloc=0xF; local_inx=4; global_list=2; local_list=2
gpu-bind: usable_gres=0x8; bit_alloc=0xF; local_inx=4; global_list=3; local_list=3
"PROCID=2: No devices found."
"PROCID=0: No devices found."
"PROCID=3: No devices found."
"PROCID=1: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2)"

I have found a related 9 month old stackoverflow question, but apart from my post there is no response. Any ideas to what could be causing this issue?

@ajdecon
Copy link
Collaborator

ajdecon commented Feb 3, 2022

I don't have any immediate insights into this, though I'll try to experiment with some parameters and see if I can duplicate the issue in the next few days.

However, because this is a Slurm-specific issue, I'd definitely recommend posting to the slurm-users mailing list. That's the most likely path to get insights from experts specifically in Slurm.

@itzsimpl
Copy link
Contributor Author

itzsimpl commented Feb 3, 2022

I will post there as well. Thank you for the hint.

FYI. Since the initial post I have upgraded hwloc (now 2.7.0), pmxi (now 4.1.0), and openmpi (now 4.1.2), and gres.conf so that the latter now reads:

Name=gpu File=/dev/nvidia0 Cores=48-63 Links=-1,0,0,0,0,0,0,0
Name=gpu File=/dev/nvidia1 Cores=48-63 Links=0,-1,0,0,0,0,0,0
Name=gpu File=/dev/nvidia2 Cores=16-31 Links=0,0,-1,,0,0,0,0,0
Name=gpu File=/dev/nvidia3 Cores=16-31 Links=0,0,0,-1,0,0,0,0
Name=gpu File=/dev/nvidia4 Cores=112-127 Links=0,0,0,0,-1,0,0,0
Name=gpu File=/dev/nvidia5 Cores=112-127 Links=0,0,0,0,0,-1,0,0
Name=gpu File=/dev/nvidia6 Cores=80-95 Links=0,0,0,0,0,0,-1,0
Name=gpu File=/dev/nvidia7 Cores=80-95 Links=0,0,0,0,0,0,0,-1

I changed it, because I have noticed errors of the form

error: xcpuinfo_abs_to_mac: failed
error: Invalid GRES data for gpu, CPUs=48-63,176-191

in slurmd.log. I even tested with AutoDetect=nvml in gres.conf but that lead to yet another type of error

gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected
gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `nvidia_a100-sxm4-40gb`. Setting system GRES type to NULL

The earlier mentioned gres.conf shows no errors in log.

What has changed with the above mentioned upgrades is the output in the case

#SBATCH --ntasks=4
#SBATCH --gres=gpu:4
#SBATCH --gpu-bind=verbose,single:1

which now gives the following output

slurmstepd: error: Bind request gpu:verbose,single:1 does not specify any devices within the allocation for task 0. Binding to the first device in the allocation instead.
slurmstepd: error: Bind request gpu:verbose,single:1 does not specify any devices within the allocation for task 1. Binding to the first device in the allocation instead.
slurmstepd: error: Bind request gpu:verbose,single:1 does not specify any devices within the allocation for task 2. Binding to the first device in the allocation instead.
slurmstepd: error: Bind request gpu:verbose,single:1 does not specify any devices within the allocation for task 3. Binding to the first device in the allocation instead.
slurmstepd: error: Bind request gpu:verbose,single:1 does not specify any devices within the allocation for task 1. Binding to the first device in the allocation instead.
slurmstepd: error: Bind request gpu:verbose,single:1 does not specify any devices within the allocation for task 2. Binding to the first device in the allocation instead.
gpu-bind: usable_gres=0x1; bit_alloc=0xF; local_inx=4; global_list=0; local_list=0
slurmstepd: error: Bind request gpu:verbose,single:1 does not specify any devices within the allocation for task 3. Binding to the first device in the allocation instead.
gpu-bind: usable_gres=0x1; bit_alloc=0xF; local_inx=4; global_list=0; local_list=0
gpu-bind: usable_gres=0x1; bit_alloc=0xF; local_inx=4; global_list=0; local_list=0
slurmstepd: error: Bind request gpu:verbose,single:1 does not specify any devices within the allocation for task 0. Binding to the first device in the allocation instead.
gpu-bind: usable_gres=0x1; bit_alloc=0xF; local_inx=4; global_list=0; local_list=0
"PROCID=3: CUDA_VISIBLE_DEVICES=0 GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"
"PROCID=2: CUDA_VISIBLE_DEVICES=0 GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"
"PROCID=0: CUDA_VISIBLE_DEVICES=0 GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"
"PROCID=1: CUDA_VISIBLE_DEVICES=0 GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"

@itzsimpl
Copy link
Contributor Author

@ajdecon have you perhaps had the chance to check if you can duplicate the issue? I have tried to post to the slurm-users mailing list, but it is no longer active, it is a read only archive. I have posted to slurm bugzilla, but until now I have not received any reply since the bug is not backed up by funding.

@ajdecon
Copy link
Collaborator

ajdecon commented Feb 22, 2022

@itzsimpl :

FWIW, while the Google group is read-only, the slurm-users group is still active and available via SchedMD's mailman interface: https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users

Unfortunately, I can't duplicate your issue in my local testing. Running Slurm 21.08.5 on my development cluster, I see this:

vagrant@virtual-login01:~$ srun --nodes=1 --ntasks=2 --gpus-per-task=1 nvidia-smi -L
GPU 0: Tesla P100-SXM2-16GB (UUID: GPU-b668a901-6b19-5c0c-554c-cce49d58af0a)
GPU 0: Tesla P100-SXM2-16GB (UUID: GPU-6fb6d3db-aafb-f8da-a089-89b8077580fe)

Note the different UUIDs, indicating that each task is getting its own GPU (but they show up as GPU 0 within each task's cgroup).

Similarly,

vagrant@virtual-login01:~$ srun --ntasks=2 --gres=gpu:2 --gpu-bind=verbose,single:1 nvidia-smi -L
gpu-bind: usable_gres=0x2; bit_alloc=0x3; local_inx=2; global_list=1; local_list=1
gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
GPU 0: Tesla P100-SXM2-16GB (UUID: GPU-6fb6d3db-aafb-f8da-a089-89b8077580fe)
GPU 0: Tesla P100-SXM2-16GB (UUID: GPU-b668a901-6b19-5c0c-554c-cce49d58af0a)

Some config files that may be relevant...

vagrant@virtual-gpu01:~$ cat /etc/slurm/gres.conf
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1

vagrant@virtual-gpu01:~$ cat /etc/slurm/cgroup.conf
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"

ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
#TaskAffinity=yes

vagrant@virtual-gpu01:~$ grep gpu /etc/slurm/slurm.conf
AccountingStorageTRES=gres/gpu
GresTypes=gpu
NodeName=virtual-gpu01  Gres=gpu:2     CPUs=8 Sockets=8 CoresPerSocket=1 ThreadsPerCore=1 Procs=8 RealMemory=15211 State=UNKNOWN
NodeName=virtual-gpu02  Gres=gpu:2     CPUs=8 Sockets=8 CoresPerSocket=1 ThreadsPerCore=1 Procs=8 RealMemory=15211 State=UNKNOWN

vagrant@virtual-gpu01:~$ grep cgroup /etc/slurm/slurm.conf
ProctrackType=proctrack/cgroup
TaskPlugin=affinity,cgroup
JobAcctGatherType=jobacct_gather/cgroup

And from slurmd.log:

[2022-02-22T19:25:51.350] slurmd started on Tue, 22 Feb 2022 19:25:51 +0000
[2022-02-22T19:25:51.662] CPUs=8 Boards=1 Sockets=8 Cores=1 Threads=1 Memory=16012 TmpDisk=126099 Uptime=3420 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2022-02-22T19:25:57.151] [5.extern] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=16012MB memsw.limit=unlimited
[2022-02-22T19:25:57.151] [5.extern] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=16012MB memsw.limit=unlimited
[2022-02-22T19:25:57.230] launch task StepId=5.0 request from UID:1000 GID:1000 HOST:10.0.0.5 PORT:39866
[2022-02-22T19:25:57.231] task/affinity: lllp_distribution: JobId=5 implicit auto binding: sockets,one_thread, dist 8192
[2022-02-22T19:25:57.231] task/affinity: _task_layout_lllp_block: _task_layout_lllp_block
[2022-02-22T19:25:57.231] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [5]: mask_cpu,one_thread, 0x01,0x02
[2022-02-22T19:25:57.315] [5.0] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=16012MB memsw.limit=unlimited
[2022-02-22T19:25:57.315] [5.0] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=16012MB memsw.limit=unlimited
[2022-02-22T19:25:57.472] [5.0] done with job
[2022-02-22T19:25:57.525] [5.extern] done with job

@itzsimpl
Copy link
Contributor Author

@ajdecon thank you for your help; in the end the issue seems to have been caused by misinterpretation of the parameter name and gres.conf specifics.

The equivalent call for gpus-per-task is gpu-bind:per_task, where each task will be bound to the number of gpus specified. The gpu-bind:single is per slurm documentation basically a block distribution of tasks onto available GPUs, where the available GPUs are determined by the socket affinity of the task and the socket affinity of the GPUs as specified in gres.conf's Cores parameter. This is also why I was seeing the slurmstepd bind request error and you do not. My gres.conf contained the Cores column to specify affinity (copied from nvidia-smi topo -m). As soon as I remove the column, the error disappears and gpu-bind:single becomes equivalent to gpu-bind:per_task. Interestingly the error is present only on a system with proper affinity (DGX-A100), on another dual processor system, where all GPUs are bound to Cores 0-15 on Numa 0 there is no error.

DGX-A100 with Cores specified in gres.conf:

$ srun --nodes=1 --tasks=2 --gres=gpu:2 --gpu-bind=verbose,single:1  nvidia-smi -L
slurmstepd: error: Bind request gpu:verbose,single:1 does not specify any devices within the allocation for task 0. Binding to the first device in the allocation instead.
slurmstepd: error: Bind request gpu:verbose,single:1 does not specify any devices within the allocation for task 1. Binding to the first device in the allocation instead.
slurmstepd: error: Bind request gpu:verbose,single:1 does not specify any devices within the allocation for task 0. Binding to the first device in the allocation instead.
gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
slurmstepd: error: Bind request gpu:verbose,single:1 does not specify any devices within the allocation for task 1. Binding to the first device in the allocation instead.
gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)

without

$ srun --nodes=1 --tasks=2 --gres=gpu:2 --gpu-bind=verbose,single:1  nvidia-smi -L
gpu-bind: usable_gres=0x1; bit_alloc=0x3; local_inx=2; global_list=0; local_list=0
gpu-bind: usable_gres=0x2; bit_alloc=0x3; local_inx=2; global_list=1; local_list=1
PROC_ID=0 GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)
PROC_ID=1 GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2)

System with all GPUs at Cores=0-15:

$ srun --nodes=1 --tasks=2 --gres=gpu:2 --gpu-bind=verbose,single:1  nvidia-smi -L
gpu-bind: usable_gres=0x2; bit_alloc=0x5; local_inx=2; global_list=2; local_list=1
gpu-bind: usable_gres=0x1; bit_alloc=0x5; local_inx=2; global_list=0; local_list=0
PROC_ID=0 GPU 0: NVIDIA RTX A5000 (UUID: GPU-80b6f32b-92a5-8495-5438-993f0d99d14b)
PROC_ID=1 GPU 0: NVIDIA RTX A5000 (UUID: GPU-7fd8c9f3-0360-7e15-cf0c-49d2fc79a046)

@itzsimpl
Copy link
Contributor Author

It's not solved, after all, it was just hiding :( seems to be related to running via a container (enroot v3.2.0, pyxis v0.11.1):

$ srun --nodes=1 --tasks=2 --gpus-per-task=1 --container-image nvcr.io#nvidia/pytorch:22.02-py3  bash -c 'echo "PROC_ID=$SLURM_PROCID $(nvidia-smi -L)"'
pyxis: importing docker image ...
PROC_ID=0 No devices found.
PROC_ID=1 GPU 0: NVIDIA RTX A5000 (UUID: GPU-43f1fb2e-11d8-30c4-1b4a-70bd77fc7e54)

vs no container

$ srun --nodes=1 --tasks=2 --gpus-per-task=1  bash -c 'echo "PROC_ID=$SLURM_PROCID $(nvidia-smi -L)"'
PROC_ID=0 GPU 0: NVIDIA RTX A5000 (UUID: GPU-80b6f32b-92a5-8495-5438-993f0d99d14b)
PROC_ID=1 GPU 0: NVIDIA RTX A5000 (UUID: GPU-43f1fb2e-11d8-30c4-1b4a-70bd77fc7e54)

@itzsimpl itzsimpl reopened this Feb 26, 2022
@itzsimpl
Copy link
Contributor Author

Just for completeness, the issue is present even when running with --gres and --gpu-bind

$srun --nodes=1 --tasks=2 --gres=gpu:2 --gpu-bind=verbose,per_task:1  --container-image nvcr.io#nvidia/pytorch:22.02-py3  bash -c 'echo "PROC_ID=$SLURM_PROCID $(nvidia-smi -L)"'
gpu-bind: usable_gres=0x1; bit_alloc=0x5; local_inx=2; global_list=0; local_list=0
pyxis: importing docker image ...
gpu-bind: usable_gres=0x1; bit_alloc=0x5; local_inx=2; global_list=0; local_list=0
PROC_ID=0 No devices found.
PROC_ID=1 GPU 0: NVIDIA RTX A5000 (UUID: GPU-7fd8c9f3-0360-7e15-cf0c-49d2fc79a046)

$ srun --nodes=1 --tasks=2 --gres=gpu:2 --gpu-bind=verbose,single:1  --container-image nvcr.io#nvidia/pytorch:22.02-py3  bash -c 'echo "PROC_ID=$SLURM_PROCID $(nvidia-smi -L)"'
gpu-bind: usable_gres=0x2; bit_alloc=0x5; local_inx=2; global_list=2; local_list=1
gpu-bind: usable_gres=0x1; bit_alloc=0x5; local_inx=2; global_list=0; local_list=0
pyxis: importing docker image ...
PROC_ID=0 No devices found.
PROC_ID=1 GPU 0: NVIDIA RTX A5000 (UUID: GPU-7fd8c9f3-0360-7e15-cf0c-49d2fc79a046)

@itzsimpl
Copy link
Contributor Author

The slumrd.log does not show any indication of a problem, i.e. when using a container it reads

[2022-02-26T15:03:00.612] [6333.extern] gres_job_state gres:gpu(7696487) type:RTX_A5000(2408090051) job:6333 flags:
[2022-02-26T15:03:00.612] [6333.extern]   total_gres:2
[2022-02-26T15:03:00.612] [6333.extern]   node_cnt:1
[2022-02-26T15:03:00.612] [6333.extern]   gres_cnt_node_alloc[0]:2
[2022-02-26T15:03:00.612] [6333.extern]   gres_bit_alloc[0]:0,2 of 4
[2022-02-26T15:03:00.661] [6333.extern] task/cgroup: _memcg_initialize: job: alloc=131072MB mem.limit=131072MB memsw.limit=unlimited
[2022-02-26T15:03:00.661] [6333.extern] task/cgroup: _memcg_initialize: step: alloc=131072MB mem.limit=131072MB memsw.limit=unlimited
[2022-02-26T15:03:00.718] launch task StepId=6333.0 request from UID:3000 GID:3000 HOST:10.0.0.5 PORT:51962
[2022-02-26T15:03:00.719] task/affinity: lllp_distribution: JobId=6333 auto binding off: mask_cpu
[2022-02-26T15:03:00.736] [6333.0] gres_job_state gres:gpu(7696487) type:RTX_A5000(2408090051) job:6333 flags:
[2022-02-26T15:03:00.736] [6333.0]   total_gres:2
[2022-02-26T15:03:00.736] [6333.0]   node_cnt:1
[2022-02-26T15:03:00.736] [6333.0]   gres_cnt_node_alloc[0]:2
[2022-02-26T15:03:00.736] [6333.0]   gres_bit_alloc[0]:0,2 of 4
[2022-02-26T15:03:00.736] [6333.0] gres:gpu type:(null)(0) StepId=6333.0 flags: state
[2022-02-26T15:03:00.736] [6333.0]   gres_bit_alloc[0]:0,2 of 4
[2022-02-26T15:03:01.431] [6333.0] task/cgroup: _memcg_initialize: job: alloc=131072MB mem.limit=131072MB memsw.limit=unlimited
[2022-02-26T15:03:01.432] [6333.0] task/cgroup: _memcg_initialize: step: alloc=131072MB mem.limit=131072MB memsw.limit=unlimited
[2022-02-26T15:03:01.481] [6333.0] pyxis: importing docker image ...
[2022-02-26T15:03:29.957] [6333.0] pyxis: creating container filesystem ...
[2022-02-26T15:03:38.801] [6333.0] pyxis: starting container ...
[2022-02-26T15:03:39.760] [6333.0] pyxis: removing container filesystem ...
[2022-02-26T15:03:39.827] [6333.extern] done with job
[2022-02-26T15:03:41.826] [6333.0] done with job

and without it reads

[2022-02-26T14:59:37.420] [6331.extern] gres_job_state gres:gpu(7696487) type:RTX_A5000(2408090051) job:6331 flags:
[2022-02-26T14:59:37.420] [6331.extern]   total_gres:2
[2022-02-26T14:59:37.420] [6331.extern]   node_cnt:1
[2022-02-26T14:59:37.420] [6331.extern]   gres_cnt_node_alloc[0]:2
[2022-02-26T14:59:37.420] [6331.extern]   gres_bit_alloc[0]:0,2 of 4
[2022-02-26T14:59:37.457] [6331.extern] task/cgroup: _memcg_initialize: job: alloc=32768MB mem.limit=32768MB memsw.limit=unlimited
[2022-02-26T14:59:37.457] [6331.extern] task/cgroup: _memcg_initialize: step: alloc=32768MB mem.limit=32768MB memsw.limit=unlimited
[2022-02-26T14:59:37.525] launch task StepId=6331.0 request from UID:3000 GID:3000 HOST:10.0.0.5 PORT:51268
[2022-02-26T14:59:37.525] task/affinity: lllp_distribution: JobId=6331 auto binding off: mask_cpu
[2022-02-26T14:59:37.543] [6331.0] gres_job_state gres:gpu(7696487) type:RTX_A5000(2408090051) job:6331 flags:
[2022-02-26T14:59:37.543] [6331.0]   total_gres:2
[2022-02-26T14:59:37.543] [6331.0]   node_cnt:1
[2022-02-26T14:59:37.543] [6331.0]   gres_cnt_node_alloc[0]:2
[2022-02-26T14:59:37.543] [6331.0]   gres_bit_alloc[0]:0,2 of 4
[2022-02-26T14:59:37.544] [6331.0] gres:gpu type:(null)(0) StepId=6331.0 flags: state
[2022-02-26T14:59:37.544] [6331.0]   gres_bit_alloc[0]:0,2 of 4
[2022-02-26T14:59:38.175] [6331.0] task/cgroup: _memcg_initialize: job: alloc=32768MB mem.limit=32768MB memsw.limit=unlimited
[2022-02-26T14:59:38.175] [6331.0] task/cgroup: _memcg_initialize: step: alloc=32768MB mem.limit=32768MB memsw.limit=unlimited
[2022-02-26T14:59:38.355] [6331.0] done with job
[2022-02-26T14:59:38.372] [6331.extern] done with job

@ajdecon
Copy link
Collaborator

ajdecon commented Mar 2, 2022

@itzsimpl : Based on reading through NVIDIA/pyxis#73, it looks like your issue is addressed by setting ENROOT_RESTRICT_DEV=n. And in any case, it's a Pyxis issue.

Are you good to close the DeepOps issue? Or do you think there's still something to address here?

@itzsimpl
Copy link
Contributor Author

itzsimpl commented Mar 3, 2022

Yes, we can close this issue. Thank you for helping me track it down to the source.

@itzsimpl itzsimpl closed this as completed Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants