In order to run a job, a proper environment has to be prepared at first.
It is highly recommended to run your job inside a Docker image container. You can find more information in the Docker tutorial page.
Before running your jobs, another thing has to be done is to check the status of the system. By doing so, we can answer about the following questions:
How many resources should/can I use for my job?
Which GPU should I run my job on?
Can I run my job now on a HPC?
The following commands are useful to moniter the status of the system as well as the jobs running: nvidia-smi
, htop
and df -lh
.
The nvidia-smi
command is powerful to see the current usage of GPUs as well as the jobs on each GPU. Below is an example after using nvidia-smi
:
From the information, we can see which GPU has enough memory for running a job. In the example, only the GPU with number 2 has ~4000MB spare memory.
{% hint style="danger" %} Do not try to run a new job on a GPU with fully-loaded memory, otherwise the old jobs on the GPU might be affected and stopped. {% endhint %}
The htop
command can show all of the jobs running on the machine as well as the usage of CPU and system memory:
For more details about the htop
command, please refer to the official website.
The df -lh
command can help you to check the usage of the disks:
{% hint style="danger" %}
If the disk mounted on /
or /raid
is full, it would be not possible to run jobs or run commands on the machine. Be sure to always remove useless data or move unused data to the backup disk.
{% endhint %}
For GPU jobs, the bottleneck is the GPU memory used. If the GPU does not have enough memory to run the job, error like
RuntimeError: CUDA error: out of memory
will be raised.
Since each GPU has 16GB memory on DGX and 32GB memory on DLS, using single GPU is enough in most cases.
For example, you can specify the GPU device for your python job by using the command:
CUDA_VISIBLE_DEVICES=N python XXXX.py
N is the GPU device number for XXXX.py
file to run on it (N can be 03 on DGX and 07 on DLS). Please check the status of GPUs (using the command nvidia-smi
) before running your job to decide an N number.
After submitting a job, users can use nvidia-smi
and htop
to check the GPU and CPU status of the job.
If you find your GPU job uses a very small amount of the computations on the GPU (e.g. the Volatile GPU-Util
is close to 0% all the time) via nvidia-smi
, it is possible that there are issues in your code.