-
Notifications
You must be signed in to change notification settings - Fork 0
/
pres.qmd
485 lines (327 loc) · 13.6 KB
/
pres.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
---
title: "psych-hpc Workshop (2024)"
format:
revealjs:
embed-resources: true
width: 1280
gfm:
toc: false
output-file: "README"
output-ext: "md"
---
## psych-hpc Workshop
### October 2024
This workshop will get you started on High Performance Computing (HPC) using Klone, part of the University of Washington’s Hyak supercomputing cluster
You will learn to:
- Log in to Klone
- Use a compute node interactively
- Run a batch job to convert hundreds of brain imaging files in parallel
## More resources
- <https://uw-psych.github.io/compute_docs>
- <https://hyak.uw.edu/docs>
- <https://www.hpc-carpentry.org>
- UW Research Computing Club: <http://depts.washington.edu/uwrcc/>
## What Is High-Performance Computing? (HPC)
::::: columns
::: {.column width="50%"}
- High-performance computing (HPC) combines computing resources to run computations and process data at a very high rate
- An HPC **cluster** is a group of computers configured to work together for these tasks
- Tasks run in **parallel** – multiple computations at the same time
:::
::: {.column width="50%"}
![](images/parallelism.png)
:::
:::::
## Why use HPC?
To get things done faster! (and maybe more cheaply)
- A typical laptop usually has 2–16 CPU cores, 4-64 GB RAM 💻
- HPC clusters often thousands of CPU cores, thousands of GB RAM 🍇
- Many clusters today have graphical processing units (GPU) that can provide orders of magnitude of speedup 🏎️
- HPC systems can allow many parties to pool their resources together to access more powerful computation resources than one can afford 💪
## HPC at UW
::::: columns
::: {.column width="50%"}
- *Hyak*[^1] is the name of the cluster project at the University of Washington
- The *Klone*[^2] cluster represents the current (3rd) generation of Hyak
- Departments purchase *slices* 🍕 of hardware that become part of the cluster
:::
::: {.column width="50%"}
![](images/klone-irl.png){width="512"}
:::
:::::
[^1]: "fast" in local Chinook
[^2]: "three" in local Chinook trade language
------------------------------------------------------------------------
## HPC at UW Psych
UW Psychology researchers have access to the `psych` account with the `cpu-g2-mem2x` partition.
- 32 CPU cores 🍏
- 490 GB RAM 🐏
The partition is very new — hopefully more CPUs, GPUs[^3] in near future
[^3]: very high demand and cost due to AI boom 💸🚀📈
------------------------------------------------------------------------
## Getting on Klone
Before you can log in to Klone, you will need:
- A UW NetID 🪪
- An SSH client (e.g. PuTTY, MobaXterm, or the terminal on Mac/Linux) on your laptop 💻 (see <https://uw-psych.github.io/compute_docs/docs/start/connect-ssh.html>)
## Logging In
Open your terminal program or SSH client and `klone.hyak.uw.edu` with your UW NetID as the username.
On macOS Terminal.app, Windows WSL2, git bash, etc., type:
``` bash
```
You will then be required to authenticate.
::: callout-caution
Type your password carefully! You will be LOCKED OUT 🔐 of Klone for a short time if you enter it wrong 3 times.
:::
## The file system
The file system on Hyak is organized as a hierarchy:
- `/` - the root directory
- `/mmfs1` - the main user file system for Hyak
- `/mmfs1/home` - the root of the home directory for all users
- `/mmfs1/home/your-uw-netid` - your home directory (only 10 GB!)
- `/gscratch` - data directory for Hyak users (aka `/mmfs1/gscratch`)
- `/gscratch/scrubbed` - a directory for temporary files that are periodically deleted
## The file system
![](images/klone-fs.png)
## Your home directory (`~`)
Once you have logged in, you will be in your **home directory** on the Hyak cluster.
- Home directory is `C:\Users\You` on Windows or `/Users/You` on Mac
- Stored under `/mmfs1/home/your-uw-netid`
- Can use `~` for short in the command prompt
## Listing files
List the contents of your `~` by typing `ls` into the command prompt and pressing `Enter`:
``` bash
ls
```
Move to another directory with `cd`, e.g.:
``` bash
cd /gscratch # [C]hange [d]irectory to /gscratch
ls # List contents
cd ~ # Move back to home
pwd # Display current directory
```
## Storage limits
- Your home directory is limited to 10 GB. Do not store large files here! Use the `/gscratch` directory instead.
- Use the `hyakstorage` command to see how much space you have available.
![](images/hyakstorage.png)
## The shell
- `bash` is the command-line interface we have been using
- Both an interpreter and a programming language
- Use interactively to run commands
- Write scripts to automate tasks
## bash
### Comments
In `bash`, anything after `#` is a comment (like Python, R).
### Environment variables
**Environment variables** help pass inputs to a script. Set environment variables with:
``` bash
# Make sure to use quotes and do not put spaces aside '=':
VARIABLE="Something like this"
# Print the value to the screen:
echo "$VARIABLE"
# More precise syntax - helps avoid some wild issues:
echo "${VARIABLE}"
# Make VARIABLE available to subsequent external commands:
export VARIABLE
```
## bash: History and Completion
- Use `↑` and `↓` keys to recall the text of the commands you have run before
- Use `Tab` to complete file names, commands etc.
- e.g., `cd /gscr` + `Tab` → `cd /gscratch`
## Editing files
Use a text editor to edit text files (scripts, etc.)
::::: columns
::: {.column width="40%"}
- `nano` is a good one if you've never tried one before
:::
::: {.column width="60%"}
![](images/nano.png)
:::
:::::
::: notes
- To start `nano`, type in the command `nano`. You can then start typing text into the editor.
- To save your work, press `Ctrl` + `O`, type in a file name, and press `Enter`.
- To exit `nano`, press `Ctrl` + `X`.
:::
## Viewing files
Use:
- `cat` to view short files
- Use `more` to view long files
- ...or `less`
``` bash
cat /etc/os-release # A short file
more /etc/slurm.conf # A long file
less /etc/slurm.conf # Another way for a long file
```
## Getting help with commands
- `man` displays the manual page ("manpage") for a command
- `man ls` - displays the manual page for the `ls` command.
- "manpages" tend to be exhaustive and overwhelming!
- Add `--help` to a command to get a shorter, more user-friendly help message for many commands
## `tldr` for quick command reference
`tldr`: A supplement to `man` pages providing practical examples:
``` bash
pip3.9 install --user tldr
tldr ls
```
![](images/tldr.png)
## `ranger`
`ranger` is an easy-to-use program to navigate the file system:
``` bash
pip3.9 install --user ranger-fm
```
::::: columns
::: {.column width="30%"}
- Navigate using arrow keys
- Open files with `Enter`
- Exit with `q`
:::
::: {.column width="70%"}
![](images/ranger-fm.png)
:::
:::::
## SLURM
- Jobs: programs/scripts you want to run + resources allocated for them
- Jobs on Hyak are scheduled using the SLURM **workload manager**
- Specify the resources you need when submitting a job
- Job runs when SLURM determines there are enough resources available
- Submitting a job runs it on the cluster when the resources are available
::: notes
Jobs are programs or scripts that you want to run on the cluster. You submit jobs to SLURM, and it schedules and runs them on the cluster when resources are available. Resource allocation depends on the amount of resources you request, the resources available on the cluster, and the resources available to the SLURM account you are using.
:::
## Login & Compute Nodes
- The **login node** is the computer you are connected to after running `SSH`
- Use the login node for submitting and managing jobs, minor tasks like editing a script or copying a handful of files
- Do **not** use it to run your computations
- **Compute nodes** are where your jobs will run
- The scheduler will allocate resources on the compute nodes to your jobs
- Jobs can be run in parallel on multiple compute nodes
## Resource availability
The main resources you will be concerned with are:
- **CPUs** - the number of CPU cores you can use
- **Memory** - the amount of RAM you can use
- **GPUs** - the number of GPUs you can use
## hyakalloc
- `hyakalloc` shows the resources available to you across all the nodes on the cluster
::::: columns
::: {.column width="40%"}
- A job made to run on a single node will have to wait for all the resources it needs to become available on a single node
:::
::: {.column width="60%"}
![](images/hyakalloc.png)
:::
:::::
## The queue
- Jobs are submitted to a **queue** in SLURM
- Use `squeue` to see the jobs in the queue that are running or waiting to run
``` bash
squeue
```
- Use `squeue --me` to see only your jobs in the queue
``` bash
squeue --me
```
## Interactive session
- An **interactive session** is a way to get access to a compute node for a short period of time
- Use an interactive session to test your code, run small jobs, or debug problems
- Use the `salloc` command to start an interactive session
## Launching an interactive session
To launch a job, you will need to specify the resources you need, the account to charge the resources to, and the partition -- group of resources -- to run the job on.
For a session using the `psych` account, the `cpu-g2-mem2x` partition, 1 hour of time, 1G of memory, 1 CPU:
``` bash
salloc \
--account psych \
--partition cpu-g2-mem2x \
--time 1:00:00 \
--mem 1G \
--cpus-per-task 1
```
You may have to wait for resources to become available -- use `squeue` to check the status of your request.
## Running commands in an interactive session
When your interactive session starts, you will be given a prompt on a compute node where you can run commands and test your code. For example, you can run the `hostname` command to see the name of the compute node you are on:
``` bash
hostname
```
Any commands you run in the interactive session will be run on the compute node you are on and will not affect the login node.
## Loading software
Several methods exist to install and load software on Klone. Chiefly:
- Modules ("Lmods") -- easiest to use, harder to install
- Containers (via Apptainer, can also load Docker containers) -- recommended for reproducibility
- Conda -- mostly for Python, can be used for others. Doesn't perform well on Klone.
## Modules
To load software installed on Hyak, use `module load`
e.g.,
``` bash
module load escience/gdu # Load the GDU disk usage visualizer
gdu # Run GDU -- press "q" to exit
```
## Questions?
Do you have any questions about what we've covered so far?
------------------------------------------------------------------------
## Running an example batch script
Now we will try to orchestrate a data processing task in parallel.
The objective of this task will be to convert several directories of DICOM brain imaging files (`.dcm`) into NIfTI (`.nii`) format. This is a common type of task that can take quite some time, but can be accelerated quickly by running in parallel.
## Input files
The data files are located under 7 directories in `/gscratch/psych/hpc-workshop-01/datafiles`.
Each directory contains up to 100 data files. For this example, we will use 4 of them.
``` bash
ls /gscratch/psych/hpc-workshop-01/datafiles
# 0 1 2 3 4 5 6 7
ls /gscratch/psych/hpc-workshop-01/datafiles/0
# I0.dcm I14.dcm I19.dcm I23.dcm I28.dcm
```
## Setup
Create a new directory for your output under `/gscratch/scrubbed/INSERT_YOUR_UW_NETID_HERE`
``` sh
mkdir -pv /gscratch/scrubbed/INSERT_YOUR_UW_NETID_HERE
# Copy the batch script to the created directory:
cp -v /gscratch/psych/hpc-workshop-01/dcm2niix.slurm /gscratch/scrubbed/INSERT_YOUR_UW_NETID_HERE
# Go to the new directory:
cd /gscratch/scrubbed/INSERT_YOUR_UW_NETID_HERE
# List the contents:
ls
```
## Batch jobs
- `sbatch` command submits a batch job to SLURM
- Commands to run the job are specified in a job script
- Job script specifies the resources to request, the commands to run, and the environment variables to set
## The job script
Have a look at the job script in `dcm2niix.slurm`. We will submit this to `sbatch`, which will schedule and launch our tasks.
## Job metadata and resources
The following defines metadata and resources to request to run the job. These parameters are defined with `#SBATCH`:
``` sh
#SBATCH --account=psych
#SBATCH --partition=cpu-g2-mem2x
#SBATCH --job-name=hpc-dcm2niix
#SBATCH --mem=8G
#SBATCH --time=1:00:00
#SBATCH --array=0-3
```
The `--array` argument lets `sbatch` know that we want to run an **array task** that executes in parallel. This lets us process several directories at once. Here, we are processing directories 0, 1, 2, and 3, so
```
/mmfs1/gscratch/psych/hpc-workshop-01/datafiles/0
/mmfs1/gscratch/psych/hpc-workshop-01/datafiles/1
/mmfs1/gscratch/psych/hpc-workshop-01/datafiles/2
/mmfs1/gscratch/psych/hpc-workshop-01/datafiles/3
```
## Running the job
Here, we launch a job for each category we want to compute on:
``` bash
sbatch dcm2niix.slurm # Launch the task
```
To monitor the job, run:
``` bash
squeue --me
```
To monitor the output from the script, run:
``` bash
tail -f *.out
```
Type CTRL+C to exit.
## Getting the results
Use `scp` to copy the results from the cluster to your local machine. If you run the following on your own machine, `scp` will copy the output directory on the cluster to the current working directory on your machine:
``` bash
scp -r [email protected]:/gscratch/scrubbed/your-uw-netid/hpc-workshop-01-output .
```
## Q&A
Questions?