Using GPUs for Gear Execution on a Slurm Cluster

Configuration
- Slurm Configuration
  - slurm.conf
  - gres.conf
- Updating the fw-cast settings
Gear Execution
- Potential Problems

Configuration

Both Slurm and fw-cast must be configured appropriately to enable GPU execution of gears on a Slurm Cluster.

Slurm Configuration

To execute gears on GPU nodes of a Slurm cluster, Slurm must be configured correctly. Your system administrator will most likely configure these settings. Below are examples of how a working configuration was set. If you don't see something like these settings on the nodes of the Slurm Cluster, it is likely that it is not set up for GPU execution.

Detailed instructions for configuring Slurm can be found at https://slurm.schedmd.com/. Including a Slurm Configuration Tool.

slurm.conf

The slurm.conf file is typically found in /etc/slurm/. Below is an example for a node definition that enables GPUs to be scheduled.

NodeName=scien-hpc-gpu Gres=gpu:tesla:1 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=2 RealMemory=14978

The Generic RESource (GRES) flag (Gres=gpu:tesla:1) must be present to indicate the resource type (e.g. "gpu"), the resource class (e.g. "tesla"), and the number of resources present ("1"). Execution on more than one GPU per node has not yet been explored.

If desired, the remainder of the node configuration (e.g. CPUs, RealMemory) can be interrogated by the following command:

slurmd -C

gres.conf

The Generic RESource (GRES) configuration, gres.conf, needs to have an entry for each resource named in slurm.conf.

NodeName=scien-hpc-gpu Name=gpu Type=tesla File=/dev/nvidia0

Here File=/dev/nvidia0 is a reference to the device that the GPU is mounted on.

Updating the fw-cast settings

It is recommended that you replace the script section of your settings/cast.yml file with the script section of the examples/settings/gpu_cast.yml file. This is also shown below.

  script: |+
   #!/bin/bash
   #SBATCH --job-name=fw-{{job.fw_id}}
   #SBATCH --ntasks=1
   #SBATCH --cpus-per-task={{job.cpu}}
   #SBATCH --mem-per-cpu={{job.ram}}
   {% if job.gpu %}#SBATCH --gpus-per-node={{job.gpu}}{% endif %}
   #SBATCH --output {{script_log_path}}
  
   set -euo pipefail
  
   source "{{cast_path}}/settings/credentials.sh"
   cd "{{engine_run_path}}"
  
   set -x
   srun ./engine run --single-job {{job.fw_id}}

Gear Execution

With the rest of the workflow configured, adding a gpu tag (in addition to the hpc tag) to the launch of the gear will schedule a GPU to execute the gear on the Slurm cluster.

Potential Problems

Without the gpu tag present on gear launch any node meeting the criteria will be scheduled.
If the cast.yml does not have the line with --gpus-per-node only CPU nodes will be scheduled.
If GPU nodes are not available on the cluster, the job will be put in a waiting state until one is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faq_use_gpu_on_slurm_cluster.md

faq_use_gpu_on_slurm_cluster.md

Using GPUs for Gear Execution on a Slurm Cluster

Configuration

Slurm Configuration

slurm.conf

gres.conf

Updating the fw-cast settings

Gear Execution

Potential Problems

Files

faq_use_gpu_on_slurm_cluster.md

Latest commit

History

faq_use_gpu_on_slurm_cluster.md

File metadata and controls

Using GPUs for Gear Execution on a Slurm Cluster

Configuration

Slurm Configuration

slurm.conf

gres.conf

Updating the fw-cast settings

Gear Execution

Potential Problems