Skip to content

Commit

Permalink
Chap 2 - Translation until nvidia-smi
Browse files Browse the repository at this point in the history
  • Loading branch information
plstonge committed Feb 26, 2024
1 parent 7049665 commit 7b1decb
Showing 1 changed file with 48 additions and 51 deletions.
99 changes: 48 additions & 51 deletions 2-resources.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Par exemple : [`scripts/mpi-allo.sh`](https://github.com/calculquebec/cip201-serveurs-calcul/blob/main/scripts/mpi-allo.sh)\n",
"For example : [`scripts/mpi-hello.sh`](https://github.com/calculquebec/cip201-compute-systems/blob/main/scripts/mpi-hello.sh)\n",
"\n",
"```Bash\n",
"cat scripts/mpi-allo.sh\n",
"cat scripts/mpi-hello.sh\n",
"```\n",
"```\n",
"#!/bin/bash\n",
Expand All @@ -58,8 +58,8 @@
"mpirun printenv HOSTNAME OMPI_COMM_WORLD_RANK OMPI_COMM_WORLD_SIZE\n",
"```\n",
"\n",
"Notre documentation à cet effet débute à la page :\n",
"[Exécuter des tâches](https://docs.alliancecan.ca/wiki/Running_jobs/fr)"
"Our documentation about job scripts starts at this page:\n",
"[Running jobs](https://docs.alliancecan.ca/wiki/Running_jobs)"
]
},
{
Expand Down Expand Up @@ -146,14 +146,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Pour soumettre un script de tâche, on utilise la\n",
"[commande `sbatch`](https://slurm.schedmd.com/sbatch.html) :\n",
"To submit a job script, we use the\n",
"[`sbatch` command](https://slurm.schedmd.com/sbatch.html) :\n",
"```Bash\n",
"sbatch scripts/blastn-gen-seq.sh\n",
"```\n",
"\n",
"Et pour voir l'état de la tâche, on utilise la\n",
"[commande `squeue`](https://slurm.schedmd.com/squeue.html) :\n",
"And to monitor the status of a job, we use the\n",
"[`squeue` command](https://slurm.schedmd.com/squeue.html) :\n",
"```Bash\n",
"squeue -u $USER # or 'sq'\n",
"```"
Expand All @@ -164,9 +164,8 @@
"metadata": {},
"source": [
"### Resources Used by a Completed Job\n",
"Avec la [commande `sacct`](https://slurm.schedmd.com/sacct.html),\n",
"on peut obtenir un tableau détaillé de nos tâches exécutées\n",
"depuis minuit :\n",
"With the [`sacct` command](https://slurm.schedmd.com/sacct.html),\n",
"we can get a detailed table of completed jobs since midnight:\n",
"```Bash\n",
"sacct\n",
"```"
Expand All @@ -176,14 +175,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Avec la [commande `seff`](https://docs.alliancecan.ca/wiki/Running_jobs/fr#T.C3.A2ches_termin.C3.A9es),\n",
"on peut obtenir un court rapport d'exécution d'une tâche.\n",
"Ce rapport inclut une mesure du temps écoulé, une mesure du temps CPU\n",
"et une mesure de consommation maximale de la mémoire-vive.\n",
"Des valeurs d'efficacité en pourcentages sont données pour les cycles CPU\n",
"et la mémoire-vives en fonction des quantités réservées.\n",
"With the [`seff` command](https://docs.alliancecan.ca/wiki/Running_jobs#Completed_jobs),\n",
"we can get a short report about a single completed job.\n",
"This report includes the elapsed time, the total\n",
"CPU time and the maximum amount of memory used.\n",
"Two values of efficiency are given in percentages of total CPU\n",
"usage and maximum memory usage (compared to requested amounts).\n",
"```Bash\n",
"seff <No_tâche>\n",
"seff <Job_ID>\n",
"```"
]
},
Expand All @@ -192,18 +191,19 @@
"metadata": {},
"source": [
"### Resources Used by a Running Job\n",
"Étant donné un certain calcul matriciel dans le script Python \n",
"[`scripts/crunch.py`](https://github.com/calculquebec/cip201-serveurs-calcul/blob/main/scripts/crunch.py) :\n",
"Given some operations on a 3D matrix in the Python script\n",
"[`scripts/crunch.py`](https://github.com/calculquebec/cip201-compute-systems/blob/main/scripts/crunch.py) :\n",
"\n",
"```Bash\n",
"cat scripts/crunch.py\n",
"```\n",
"\n",
"Lors d'une tâche interactive, on peut utiliser `top` et `htop` pour surveiller les ressources utilisées :\n",
"While an interactive job is running, we can use the `top`\n",
"and `htop` commands to monitor resources being used:\n",
"\n",
"```Bash\n",
"# Interactive job\n",
"salloc --ntasks-per-node=4 --mem=8000M --time=0:15:0\n",
"salloc --cpus-per-task=4 --mem=8000M --time=0:15:0\n",
"\n",
"cat scripts/crunch.sh\n",
"\n",
Expand All @@ -226,41 +226,39 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Si vous utilisez\n",
"[JupyterHub](https://docs.alliancecan.ca/wiki/JupyterHub/fr)\n",
"pour profiler votre programme, vous pouvez visualiser en temps\n",
"réel la consommation de ressources dans l'onglet NV Dashboard:\n",
"If you use\n",
"[JupyterHub](https://docs.alliancecan.ca/wiki/JupyterHub)\n",
"to profile your codes, you can visualize in real time the\n",
"use of the _Machine Resources_ in the tab _GPU Dashboards_:\n",
"\n",
"![NV Dashboard CPU](images/nv-dashboard_cpu.png)"
"![GPU Dashboard for CPU](images/nv-dashboard_cpu.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Exercise** - Checking Resources Used by a Running Job\n",
"Pendant que vos tâches sont actives, vous pouvez vous connecter par\n",
"SSH aux noeuds de calcul correspondants afin de valider que\n",
"l'exécution se passe bien :\n",
"While your job is running, you are allowed to connect by SSH to\n",
"the corresponding compute node in order to monitor your processes:\n",
"```Bash\n",
"cat scripts/inv-mat.sh\n",
"sbatch scripts/inv-mat.sh\n",
"```\n",
"\n",
"Voici les étapes de validation (à adapter pour l'exercice) :\n",
"* Identification du ou des noeud(s) avec : `squeue -u $USER`\n",
"* Connexion avec : `ssh <nom_noeud>`\n",
"* Inspection avec `top` et/ou `htop` :\n",
" * Est-ce que vos processus s'exécutent avec un **pourcentage de 100%?**\n",
" * Est-ce que vos processus parallèles s'exécutent avec un\n",
" **pourcentage de $n$ * 100%**, où $n$ est le nombre de processeurs\n",
" par tâche Slurm?\n",
" * Est-ce que le **noeud de calcul** semble pleinement utilisé?\n",
"* **Inspection des résultats**\n",
" * Identifier tout problème, s'il y a lieu; trouver la cause\n",
" * Corriger le code source, la compilation, le script ou les\n",
" paramètres de la tâche de calcul\n",
" * Relancer la tâche de calcul et refaire les précédentes étapes"
"Here are some general steps for job monitoring and validation:\n",
"* Identify on which node your job is running: `squeue -u $USER`\n",
"* Connect to that node with: `ssh <node_name>`\n",
"* Monitor the job execution with `top` or `htop`:\n",
" * Are your processes running at **near 100%?**?\n",
" * Are your parallel processes running at **near $n$ * 100%**,\n",
" where $n$ is the number of reserved CPU cores for the job?\n",
" * Does the **compute node** seem fully utilized?\n",
"* **Inspect results** in `time_inv.csv`\n",
" * Identify any problem. If any, find the cause\n",
" * Correct the code, the compilation, the script\n",
" or the parameters used for the compute task\n",
" * Resubmit the compute job and redo the above validation steps"
]
},
{
Expand All @@ -270,15 +268,14 @@
"### (Demo) Checking Resources Used by a Running GPU Job\n",
"```Bash\n",
"# Interactive job\n",
"salloc --ntasks-per-node=4 --mem=8000M --time=0:15:0 --gres=gpu:1\n",
"salloc --cpus-per-task=4 --mem=8000M --time=0:15:0 --gres=gpu:1\n",
"```\n",
"\n",
"* Pour Windows et Mac OS, il existe des outils propriétaires\n",
" permettant de visualiser en temps réel l'utilisation du GPU.\n",
" Veuillez vous référer au site Web du manufacturier de votre GPU\n",
" pour les détails\n",
"* Sous Linux, il y a d'abord la\n",
" [commande `nvidia-smi`](https://developer.nvidia.com/nvidia-system-management-interface)\n",
"* For Windows and Mac OS, you can install proprietary software\n",
" that allows real time visualization of the GPU utilization.\n",
" Please check the documentation of the GPU manufacturer for details\n",
"* In Linux, with an NVIDIA GPU, we first have the\n",
" [`nvidia-smi` command](https://developer.nvidia.com/nvidia-system-management-interface)\n",
"\n",
"```Bash\n",
"nvidia-smi\n",
Expand Down

0 comments on commit 7b1decb

Please sign in to comment.