-
Notifications
You must be signed in to change notification settings - Fork 23
/
R4HPC_Part2.Rmd
422 lines (320 loc) · 14.2 KB
/
R4HPC_Part2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
---
title: "Using R on HPC Clusters      Part 2"
author: "George Ostrouchov"
institute: "Oak Ridge National Laboratory"
date: "<br><br><br><br><br><br> August 19, 2022 <br><br><span style = 'font-size: 50%;'> Background Image: FRONTIER, First Top500 exascale system, announced June 2022</span>"
output:
xaringan::moon_reader:
css: ["default", "default-fonts", "my-theme.css", "widths.css"]
lib_dir: libs
includes:
after_body: insert-logo.html
nature:
titleSlideClass: ["right", "inverse"]
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
self_contained: true
---
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
xaringanExtra::use_xaringan_extra(c("tile_view", "animate_css", "tachyons"))
hook_source <- knitr::knit_hooks$get('source')
knitr::knit_hooks$set(source = function(x, options) {
x <- stringr::str_replace(x, "^[[:blank:]]?([^*].+?)[[:blank:]]*#<<[[:blank:]]*$", "*\\1")
hook_source(x, options)
})
```
# Get this presentation:
`git clone https://github.com/RBigData/R4HPC.git`
* Open `R4HPC_Part2.html` in your web browser
* `?` toggles help
<br>
Slack workspace link for this workshop was emailed to you.
<br><br><br>
<small>
*Many thanks to my colleagues and former colleagues who contributed to the software and ideas presented here. See the RBigData Organization on Github: https://github.com/RBigData. Also, many thanks to all R developers of packages used in this presentation.*
*This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).*</small>
---
# Today's Package Installs
## See `R4HPC/code_4` R script and shell scripts for your machine
---
## Running MPI on a Laptop
macOS in a Terminal window:
* `brew install openmpi`
* `mpirun -np 4 Rscript your_spmd_code.R`
Windows
* Web Page: https://docs.microsoft.com/en-us/message-passing-interface/microsoft-mpi
* Download: https://www.microsoft.com/en-us/download/details.aspx?id=100593
* `pbdMPI` has a Windows binary on CRAN
---
## Revisit `hello_balance.R` in `code_1`
```{r eval = FALSE, code = readr::read_lines("code_1/hello_balance.R", skip=8), }
```
---
## Working with a remote cluster using R
```{r echo=FALSE, out.height=500}
knitr::include_graphics("pics/01-intro/Workflow.jpg")
```
---
background-image: url(pics/01-intro//WorkflowRunning.jpg)
background-position: top right
background-size: 20%
## Running Distributed on a Cluster
```{r echo=FALSE, out.height=500}
knitr::include_graphics("pics/01-intro/BatchRonCluster.jpg")
```
---
## Section I: Environment and Workflow
## Section II: Parallel Hardware and Software Overview
## Section III: Shared Memory Tools
## <mark>Section IV:</mark> **Distributed Memory Tools**
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide7distributed.jpg)
background-position: bottom
background-size: 90%
# Distributed Memory Tools
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide7mpi.jpg)
background-position: bottom
background-size: 90%
# Message Passing Interface (MPI)
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide7mpi.jpg)
background-position: top right
background-size: 20%
## Single Program Multiple Data (SPMD)
* N instances of the same code cooperate
* Each of the N instances has `rank`, {0, . . ., N-1}
* The `rank` determines any differences in work
* Instances run asynchronously
* SPMD parallelization is a generalization of the serial code
* Many rank-aware operations are automated
* Collective operations are high level and easy to learn
* Explicit point-to-point communications are an advanced topic
* Multilevel parallelism is possible
* Typically no manager, it is all cooperation
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide7mpi.jpg)
background-position: top right
background-size: 20%
# pbdR Project
```{r echo=FALSE, out.height=180, out.width=800, fig.align='left'}
knitr::include_graphics("pics/01-intro/pbdRlib.jpg")
```
* Bridge HPC with high-productivity of R: Expressive for data and modern statistics
* Keep syntax identical to R, when possible
* Software reuse philosophy:
* Don't reinvent the wheel when possible
* Introduce HPC standards with R flavor
* Use scalable HPC libraries with R convenience
* Simplify and use R intelligence where possible
???
Using HPC concepts and libraries
* Benefits the R user by knowing standard components of HPC
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide7mpi.jpg)
background-position: top right
background-size: 20%
```{r echo=FALSE, out.height=80, out.width=300, fig.align='left'}
knitr::include_graphics("pics/01-intro/pbdRlib.jpg")
```
# Package `pbdMPI`
* Specializes in SPMD programming for HPC clusters
* Manages printing from ranks
* Provides chunking options
* Provides communicator splits for multilevel parallelism
* In situ capability to process data from other MPI codes without copy
* A derivation and rethinking of the `Rmpi` package aimed at HPC clusters
* Simplified interface with fewer parameters (using R's S4 methods)
* Faster for matrix and array data - no serialization
???
* Prefer pbdMPI over Rmpi due to simplification and speed
* No serialization for arrays and vectors
* Drops spawning a cluster
* Because a client-server relationship is more appropriate
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide7mpi.jpg)
background-position: top right
background-size: 20%
## `pbdMPI`: High-level Collective Communications
Each of these operations is performed across a `communicator` of ranks. Simplest one is all ranks but rank arrays can be used for multilevel collectives.
* **`reduce()`** Reduces a set of same-size distributed vectors or arrays with an operation (+ is default). Fast because both communication and reduction are parallel and no serialization is needed.
* **`allreduce()`** Same as `reduce()` except all ranks in a `comm` get the result
* **`gather()`** Gathers a set of distributed objects
* **`allgather()`** Same as `gather()` except all ranks in a `comm` get the result
* **`bcast()`** Broadcasts an object from one rank to all in its `comm`
* **`scatter()`** Broadcasts different pieces of an object from one rank to all in its `comm`
* **`barrier()`** Waits on all ranks in a `comm` before proceeding
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide7mpi.jpg)
background-position: top right
background-size: 20%
## `pbdMPI`: High-level Collective Operations
$\small \bf A = \sum_{i=1}^nX_i$ $\quad$ $\qquad$ $\qquad$ **`A = reduce(X)`** $\qquad$ $\qquad$ **`A = allreduce(X)`**
$\small \bf A = \left[ X_1 | X_2 | \cdots | X_n \right]$ $\qquad$ **`A = gather(X)`** $\qquad$ $\qquad$ **`A = allgather(X)`**
```{r echo=FALSE, out.height=250, fig.align='right'}
knitr::include_graphics("pics/01-intro/RHistory3sub.png")
```
???
* Powerful: communication and reduction is highly parallel
* that's why it beats Spark/MapReduce
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide7mpi.jpg)
background-position: top right
background-size: 20%
## `pbdMPI`: Functions to Facilitate SPMD Programming
* **`comm.chunk()`** splits a number into chunks in various ways and various formats. Tailored for SPMD programming, returning rank-specific results.
* **`comm.set.seed()`** sets the seed of a parallel RNG. If diff = FALSE, then all ranks generate the same stream. Otherwise, ranks generate different streams.
* **`comm.print()`** and **`comm.cat()`** print by default from rank 0 only, with options to print from any or all ranks.
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide6.png)
background-position: bottom
background-size: 90%
## Distributed Programming Works in Shared Memory
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide7mpi.jpg)
background-position: top right
background-size: 20%
## Hands on Session 5: Hello MPI Ranks
`code_5/hello_world.R`
```{r eval = FALSE, code = readLines("code_5/hello_world.R")}
```
### **Rank** distinguishes the parallel copies of the same code
---
### Hands-on Session 5 - `R4HPC/code_2/rf_serial.R`
```{r eval=FALSE, code = readLines("code_2/rf_serial.R")}
```
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide7mpi.jpg)
background-position: top right
background-size: 20%
## Hands on Session 5: Random Forest with MPI
`code_5/rf_mpi.R`
```{r eval = FALSE, code = readLines("code_5/rf_mpi.R")}
```
---
## Hands on Session 5: `comm.chunk()`
`mpi_shorts/chunk.r`
```{r eval = FALSE, code = readLines("mpi_shorts/chunk.r")}
```
---
## Hands on Session 5: other short MPI codes
bcast.r chunk.r comm_split.R cov.r gather-named.r gather.r gather-unequal.r hello-p.r hello.r map-reduce.r mcsim.r ols.r qr-cop.r rank.r reduce-mat.r timer.r
* These short codes only use `pbdMPI` and can run on a laptop in a terminal window if you installed OpenMPI
* On the clusters these can run on a login node with a small $^*$ number of ranks
* Wile in the `mpi_shorts` directory, run the following
* `source ../code_4/modules_MACHINE.sh`
* `mpirun -np 4 Rscript your_script.r`
.footnote[
$^*$ Note that running long or large jobs on login nodes is strongly discouraged
]
---
## Shared Memory - MPI or fork?
.w80.pull-left[
* fork via `mclapply()` + `do.call()` combine
```{r echo=FALSE, out.height=150, out.width=130}
knitr::include_graphics("pics/mpi/fork.jpg")
```
* MPI replicated data + `allreduce()`
```{r echo=FALSE, out.height=150}
knitr::include_graphics("pics/mpi/mpi-replicate.png")
```
* MPI chunked data + `allreduce()`
```{r echo=FALSE, out.height=58}
knitr::include_graphics("pics/mpi/mpi-partition.png")
```
]
.w20.pull-right[
`do.call()` is serial
`allreduce()` is parallel
]
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide7mpi.jpg)
background-position: top right
background-size: 20%
```{r echo=FALSE, out.height=100, out.width=300, fig.align='left'}
knitr::include_graphics("pics/01-intro/pbdRlib.jpg")
```
# Package `pbdDMAT`
* ScaLAPACK: Distributed version of LAPACK (uses PBLAS/BLAS but not LAPACK)
* 2d Block-Cyclic data layout - mostly automated in `pbdDMAT` package
* BLACS: Communication collectives for distributed matrix computation
* PBLAS: Distributed BLAS (uses standard BLAS within blocks)
* R code is identical for most matrix operations by overloading operators and `ddmatrix` class
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide7mpi.jpg)
background-position: top right
background-size: 20%
```{r echo=FALSE, out.height=100, out.width=300, fig.align='left'}
knitr::include_graphics("pics/01-intro/pbdRlib.jpg")
```
# Package `pbdML`
* A demonstration of `pbdDMAT` package capabilities
* Includes
* Randomized SVD
* Randomized principal components analysis
* Robust Principal Component Analysis?"
from https://arxiv.org/pdf/0912.3599.pdf
---
## Hands on Session $\quad$ rsvd:
#### Singular value decomposition via randomized sketching
<br><br>
Randomized sketching produces fast new alternatives to classical numerical linear algebra computations.
<br>
Guarantees are given with probability statements instead of classical error analysis.
<br> <br>
Martinsson, P., & Tropp, J. (2020). Randomized numerical linear algebra: Foundations and algorithms. Acta Numerica, 29, 403-572. [https://doi.org/10.48550/arXiv.2002.01387](https://doi.org/10.48550/arXiv.2002.01387)
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide7mpi.jpg)
background-position: top right
background-size: 20%
## Hands on Session $\quad$ rsvd:
#### Randomized SVD via subspace embedding
Given an $n\times p$ matrix $X$ and $k = r + 10$, where $r$ is the *effective rank* of $X$:
1. Construct a $p\times k$ random matrix $\Omega$
2. Form $Y = X \Omega$
3. Decompose $Y = QR$
$Q$ is an orthogonal basis for the columnspace of $Y$, which with high probability is the columnspace of $X$. To get the SVD of $X$:
1. Compute $C= Q^TX$
2. Decompose $C = \hat{U}\Sigma V^T$
3. Compute $U = Q\hat{U}$
4. Truncate factorization to $r$ columns
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide7mpi.jpg)
background-position: top right
background-size: 20%
## MNIST Data
```{r echo=FALSE, out.height=500, fig.align='left'}
knitr::include_graphics("pics/mnist/Rplots.png")
```
---
`mnist_rsvd.R`
```{r eval=FALSE, code = readLines("rsvd/mnist_rsvd.R")}
```
---
### Hands-on Sessionrsvd
```{r eval=FALSE, code = readLines("rsvd/rsvd_snippet.donotrun")}
```
---
background-image: url(pics/Mangalore/ParallelSoftware/Slide7mpi.jpg)
background-position: top right
background-size: 20%
```{r echo=FALSE, out.height=100, out.width=300, fig.align='left'}
knitr::include_graphics("pics/01-intro/pbdRlib.jpg")
```
# Package `kazaam`
* Distributed methods for tall matrices (and some for wide matrices) that exploit the short dimension for speed and long dimension for parallelism
* Tall matrices, `shaq` class, are chunked by blocks of rows
* Wide matrices, `tshaq` class, are chunked by blocks of columns
* Much like `pbdDMAT`, most matrix operations in R code are identical to serial through overloading operators and `shaq` `S4` class
.footnote[
Naming is a "tongue-in-cheek" play on 'Shaquille' 'ONeal' ('Shaq') and the film 'Kazaam'
]
---
# Hands on Session `kazaam`
---
# Acknowledgments
*This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.*
*This research used resources of the Compute and Data Environment for Science (CADES) at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725"*
*Slides are made with the xaringan R package. It is an R Markdown extension based on the JavaScript library remark.js.*