diff --git a/2-resources.ipynb b/2-resources.ipynb index 68f0ed1..5868da8 100644 --- a/2-resources.ipynb +++ b/2-resources.ipynb @@ -42,10 +42,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Par exemple : [`scripts/mpi-allo.sh`](https://github.com/calculquebec/cip201-serveurs-calcul/blob/main/scripts/mpi-allo.sh)\n", + "For example : [`scripts/mpi-hello.sh`](https://github.com/calculquebec/cip201-compute-systems/blob/main/scripts/mpi-hello.sh)\n", "\n", "```Bash\n", - "cat scripts/mpi-allo.sh\n", + "cat scripts/mpi-hello.sh\n", "```\n", "```\n", "#!/bin/bash\n", @@ -58,8 +58,8 @@ "mpirun printenv HOSTNAME OMPI_COMM_WORLD_RANK OMPI_COMM_WORLD_SIZE\n", "```\n", "\n", - "Notre documentation à cet effet débute à la page :\n", - "[Exécuter des tâches](https://docs.alliancecan.ca/wiki/Running_jobs/fr)" + "Our documentation about job scripts starts at this page:\n", + "[Running jobs](https://docs.alliancecan.ca/wiki/Running_jobs)" ] }, { @@ -146,14 +146,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Pour soumettre un script de tâche, on utilise la\n", - "[commande `sbatch`](https://slurm.schedmd.com/sbatch.html) :\n", + "To submit a job script, we use the\n", + "[`sbatch` command](https://slurm.schedmd.com/sbatch.html) :\n", "```Bash\n", "sbatch scripts/blastn-gen-seq.sh\n", "```\n", "\n", - "Et pour voir l'état de la tâche, on utilise la\n", - "[commande `squeue`](https://slurm.schedmd.com/squeue.html) :\n", + "And to monitor the status of a job, we use the\n", + "[`squeue` command](https://slurm.schedmd.com/squeue.html) :\n", "```Bash\n", "squeue -u $USER # or 'sq'\n", "```" @@ -164,9 +164,8 @@ "metadata": {}, "source": [ "### Resources Used by a Completed Job\n", - "Avec la [commande `sacct`](https://slurm.schedmd.com/sacct.html),\n", - "on peut obtenir un tableau détaillé de nos tâches exécutées\n", - "depuis minuit :\n", + "With the [`sacct` command](https://slurm.schedmd.com/sacct.html),\n", + "we can get a detailed table of completed jobs since midnight:\n", "```Bash\n", "sacct\n", "```" @@ -176,14 +175,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Avec la [commande `seff`](https://docs.alliancecan.ca/wiki/Running_jobs/fr#T.C3.A2ches_termin.C3.A9es),\n", - "on peut obtenir un court rapport d'exécution d'une tâche.\n", - "Ce rapport inclut une mesure du temps écoulé, une mesure du temps CPU\n", - "et une mesure de consommation maximale de la mémoire-vive.\n", - "Des valeurs d'efficacité en pourcentages sont données pour les cycles CPU\n", - "et la mémoire-vives en fonction des quantités réservées.\n", + "With the [`seff` command](https://docs.alliancecan.ca/wiki/Running_jobs#Completed_jobs),\n", + "we can get a short report about a single completed job.\n", + "This report includes the elapsed time, the total\n", + "CPU time and the maximum amount of memory used.\n", + "Two values of efficiency are given in percentages of total CPU\n", + "usage and maximum memory usage (compared to requested amounts).\n", "```Bash\n", - "seff \n", + "seff \n", "```" ] }, @@ -192,18 +191,19 @@ "metadata": {}, "source": [ "### Resources Used by a Running Job\n", - "Étant donné un certain calcul matriciel dans le script Python \n", - "[`scripts/crunch.py`](https://github.com/calculquebec/cip201-serveurs-calcul/blob/main/scripts/crunch.py) :\n", + "Given some operations on a 3D matrix in the Python script\n", + "[`scripts/crunch.py`](https://github.com/calculquebec/cip201-compute-systems/blob/main/scripts/crunch.py) :\n", "\n", "```Bash\n", "cat scripts/crunch.py\n", "```\n", "\n", - "Lors d'une tâche interactive, on peut utiliser `top` et `htop` pour surveiller les ressources utilisées :\n", + "While an interactive job is running, we can use the `top`\n", + "and `htop` commands to monitor resources being used:\n", "\n", "```Bash\n", "# Interactive job\n", - "salloc --ntasks-per-node=4 --mem=8000M --time=0:15:0\n", + "salloc --cpus-per-task=4 --mem=8000M --time=0:15:0\n", "\n", "cat scripts/crunch.sh\n", "\n", @@ -226,12 +226,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Si vous utilisez\n", - "[JupyterHub](https://docs.alliancecan.ca/wiki/JupyterHub/fr)\n", - "pour profiler votre programme, vous pouvez visualiser en temps\n", - "réel la consommation de ressources dans l'onglet NV Dashboard:\n", + "If you use\n", + "[JupyterHub](https://docs.alliancecan.ca/wiki/JupyterHub)\n", + "to profile your codes, you can visualize in real time the\n", + "use of the _Machine Resources_ in the tab _GPU Dashboards_:\n", "\n", - "![NV Dashboard CPU](images/nv-dashboard_cpu.png)" + "![GPU Dashboard for CPU](images/nv-dashboard_cpu.png)" ] }, { @@ -239,28 +239,26 @@ "metadata": {}, "source": [ "#### **Exercise** - Checking Resources Used by a Running Job\n", - "Pendant que vos tâches sont actives, vous pouvez vous connecter par\n", - "SSH aux noeuds de calcul correspondants afin de valider que\n", - "l'exécution se passe bien :\n", + "While your job is running, you are allowed to connect by SSH to\n", + "the corresponding compute node in order to monitor your processes:\n", "```Bash\n", "cat scripts/inv-mat.sh\n", "sbatch scripts/inv-mat.sh\n", "```\n", "\n", - "Voici les étapes de validation (à adapter pour l'exercice) :\n", - "* Identification du ou des noeud(s) avec : `squeue -u $USER`\n", - "* Connexion avec : `ssh `\n", - "* Inspection avec `top` et/ou `htop` :\n", - " * Est-ce que vos processus s'exécutent avec un **pourcentage de 100%?**\n", - " * Est-ce que vos processus parallèles s'exécutent avec un\n", - " **pourcentage de $n$ * 100%**, où $n$ est le nombre de processeurs\n", - " par tâche Slurm?\n", - " * Est-ce que le **noeud de calcul** semble pleinement utilisé?\n", - "* **Inspection des résultats**\n", - " * Identifier tout problème, s'il y a lieu; trouver la cause\n", - " * Corriger le code source, la compilation, le script ou les\n", - " paramètres de la tâche de calcul\n", - " * Relancer la tâche de calcul et refaire les précédentes étapes" + "Here are some general steps for job monitoring and validation:\n", + "* Identify on which node your job is running: `squeue -u $USER`\n", + "* Connect to that node with: `ssh `\n", + "* Monitor the job execution with `top` or `htop`:\n", + " * Are your processes running at **near 100%?**?\n", + " * Are your parallel processes running at **near $n$ * 100%**,\n", + " where $n$ is the number of reserved CPU cores for the job?\n", + " * Does the **compute node** seem fully utilized?\n", + "* **Inspect results** in `time_inv.csv`\n", + " * Identify any problem. If any, find the cause\n", + " * Correct the code, the compilation, the script\n", + " or the parameters used for the compute task\n", + " * Resubmit the compute job and redo the above validation steps" ] }, { @@ -270,15 +268,14 @@ "### (Demo) Checking Resources Used by a Running GPU Job\n", "```Bash\n", "# Interactive job\n", - "salloc --ntasks-per-node=4 --mem=8000M --time=0:15:0 --gres=gpu:1\n", + "salloc --cpus-per-task=4 --mem=8000M --time=0:15:0 --gres=gpu:1\n", "```\n", "\n", - "* Pour Windows et Mac OS, il existe des outils propriétaires\n", - " permettant de visualiser en temps réel l'utilisation du GPU.\n", - " Veuillez vous référer au site Web du manufacturier de votre GPU\n", - " pour les détails\n", - "* Sous Linux, il y a d'abord la\n", - " [commande `nvidia-smi`](https://developer.nvidia.com/nvidia-system-management-interface)\n", + "* For Windows and Mac OS, you can install proprietary software\n", + " that allows real time visualization of the GPU utilization.\n", + " Please check the documentation of the GPU manufacturer for details\n", + "* In Linux, with an NVIDIA GPU, we first have the\n", + " [`nvidia-smi` command](https://developer.nvidia.com/nvidia-system-management-interface)\n", "\n", "```Bash\n", "nvidia-smi\n",