Skip to content

Commit

Permalink
35: add TPS slides and something on local storage
Browse files Browse the repository at this point in the history
  • Loading branch information
blackwer committed Oct 1, 2024
1 parent 1f313c7 commit 42fef05
Showing 1 changed file with 31 additions and 0 deletions.
31 changes: 31 additions & 0 deletions 35_IntroToHPC/main.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,12 @@ Activities where participants all actively work to foster an environment which e
</div>


### Which statement is generally true about what we mean by a "cluster"?
- A. One very powerful computer
- B. Many geographically dispersed computers connected via the internet
- C. Network of CPU nodes with lots of RAM that are linked together with some kind of network


### Network/fabric
- Network/fabric - the means of communication between nodes
- Communication lines usually fiber/copper/wireless
Expand All @@ -112,6 +118,11 @@ Activities where participants all actively work to foster an environment which e
- Ethernet -- 0.1ms -- \~1-40 Gbit/s -- network
- Infiniband -- 0.001ms -- \~100-800 Gbit/s -- fabric

### Which statement is false?
- A. Latency is the time between sending and receiving messages
- B. Bandwidth is the rate at which messages can be sent
- C. Infiniband fabric has relatively 'high' latency and 'low' bandwidth


### Compute nodes
<div style="display: flex;">
Expand All @@ -137,6 +148,12 @@ Activities where participants all actively work to foster an environment which e
- Cores typically slower than laptop/workstation cores, but more of them and more cache/RAM


### Which statement is true about nodes and cores?
- A. There is one node per supercomputer
- B. Each node has multiple CPU cores
- C. Cores in supercomputers are typically faster than laptop cores and have less RAM


### Compute node architecture -- `lstopo`
- Cores also sometimes have extra groupings in `NUMA` (non-uniform memory architecture) domains
- Tells what hardware has direct access to what memory
Expand Down Expand Up @@ -205,6 +222,12 @@ Activities where participants all actively work to foster an environment which e
- A handful of GPUs for special purposes


### Rusty/popeye storage -- local
- All worker nodes have fast `NVMe` storage local to the machine
- Usually about 2 terabytes in the `/tmp` path
- Automatically deleted at job completion!


### Rusty/popeye storage -- home
- `/mnt/home/$USER` AKA `$HOME` -- default path
- Put your source code and software installs here!
Expand Down Expand Up @@ -244,6 +267,13 @@ Activities where participants all actively work to foster an environment which e
https://wiki.flatironinstitute.org/SCC/Hardware/Storage


### Which statement is true about file systems at FI?
- A. I should put many small files ina single directory on ceph
- B. I should put large files in my home directory
- C. Home and ceph are the only options for storing data during a job
- D. Files stored in my home directory are backed up while ones on ceph are not



## Environment management

Expand Down Expand Up @@ -302,6 +332,7 @@ python pi.py 100000 0
- We could make our code more efficient...
- But let's throw some power at it, some options are:
- `MPI` (message passing interface) using `openmpi`
- `srun` to run multiple copies
- multiple serial jobs via `disBatch`
- could loop through calls to python in sbatch script, but hard to balance and error prone
- could use small jobs or job array with slurm, but this angers the compute gods
Expand Down

0 comments on commit 42fef05

Please sign in to comment.