-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtransform.qmd
662 lines (412 loc) · 29.9 KB
/
transform.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
---
format:
revealjs:
theme: [serif, custom.scss]
scrollable: true
logo: media/chopr.png
footer: Arcus Education / Children's Hospital of Philadelphia (CHOP) R User Group
css: styles.css
---
- Use keyboard arrow keys to
- advance ( → ) and
- go back ( ← )
- Type "s" to see speaker notes
- Type "?" to see other keyboard shortcuts
```{r echo = FALSE}
library(countdown)
```
## Part III: Transform {.section-break background-image="media/cork-board.png" background-size="400px" background-repeat="repeat"}
::: notes
Alright, now let's go into transforming data!
:::
## The Data Analysis Pipeline
![](media/tidy_transform_phase.png){.two-thirds fig-align="center"}
::: notes
In this section, we'll build skills that deal with tidying and transforming data by reshaping data to meet your needs.
:::
------------------------------------------------------------------------
::: {.columns .v-center-container}
::: {.column width="50%"}
![](media/dplyr_logo.png)
:::
::: {.column width="50%"}
a grammar for transforming data frames
`library(dplyr)` OR `library(tidyverse)`
:::
:::
::: notes
`dplyr` (pronounced dee-ply-er, a play on words with "data" and "pliers") is a useful R package we'll discuss in this section. The various functions we'll use, like `select`, `filter`, and `mutate` are all functions that belong to the `dplyr` package.
Just as a reminder, in R, we bring in the functionality of a package by using the `library()` command. Because `dplyr` forms part of the `tidyverse` suite of packages, we can bring in the useful functions of `dplyr` by either using the `library(dplyr)` command or the `library(tidyverse)` command.
The idea behind dplyr is that any data analytic task can be broken down into a small number of basic or atomic tasks, and there should be a consistent way to specify each atomic task - a grammar.
As we will see in this last section, each dplyr function takes a data frame, does something with it, and then returns a modified data frame as its output. Dplyr functions can be strung together to create powerful data analysis pipelines in just a few lines of code.
:::
## Subsetting Data {.section-break background-image="media/cork-board.png" background-size="400px" background-repeat="repeat"}
::: notes
Often, you have a large data frame but want to create a graph or analyze data from only a small part of it. The `dplyr` package, part of the larger [tidyverse](https://www.tidyverse.org/) set of packages, works great for this purpose.
Let's look at how you can subset a data frame (choose only certain columns and/or rows) by using `dplyr`.
:::
## Subsetting Columns vs Rows
::: columns
::: {.column width="50%"}
`select()`
:::
::: {.column width="50%"}
![](media/dplyr_select.png){fig-align="center"}
:::
:::
::: columns
::: {.column width="50%"}
`filter()`
:::
::: {.column width="50%"}
![](media/dplyr_filter.png){fig-align="center"}
:::
:::
::: notes
`dplyr` provides two functions for subsetting data frames: `select()` for subsetting columns, and `filter()` for subsetting rows:
`select()` reshapes data so that it includes only the columns you specify.
`filter()` reshapes data so that it includes only the rows that meet your conditions.
:::
## select()
select(data_frame, ...)
::: notes
Let's look at `select()` first. The `select()` function extracts **columns** from a data frame, using the column **name(s)** as argument(s).
`select()` takes a data frame as its first argument. After that it takes any number of additional arguments that specify the names of the columns that you want to pick.
We extract columns by name with code that looks like this, and we replace the three dots with the names of the columns we want to keep.
:::
## select()
select(covid_testing, mrn, last_name)
![](media/select_covid_example.png){fig-align="center"}
::: notes
Let's examine the code on this slide.
This `select` statement will take the data frame `covid_testing`, and return a new data frame that only has the columns `mrn` and `last_name`, shown here in blue to help you visualize this transformation.
An important point to note here is that `select` **will not modify the original data frame** but simply returns the altered data frame you asked for, without saving it automatically.
If you write the `select` statement like this it will simply print out the result in the console or in your R Markdown document. If you want to **capture** the modified data frame you need to **assign** it to a named object.
:::
## Your Turn #1 {.smaller background-color="#e3faf1"}
Which of the following will select the first_name column from the covid_testing data frame and capture the result in a data frame named newdata?
**A)** newdata = select(first_name, covid_testing)
**B)** newdata \<- select(covid_testing, first_name)
**C)** select(newdata, covid_testing, first_name)
**D)** newdata \<- select(covid_testing, First_Name)
**E)** Both B and D
A poll will come up for you to put in your answer in Teams!
```{r}
countdown(minutes=1)
```
::: notes
Great, we have some folks saying [whatever], others are suggesting [whatever]. The answer is B.
A isn't correct, because the arguments are in the wrong order. The first argument in the tidyverse functions we're studying today is always going to be the data frame you want to work with. That means the first argument should be covid_testing.
C isn't correct because you have to use the assignment arrow to save the new, one-column-only data frame to an object called newdata. You don't pass the name you want to apply to the object as an argument.
D isn't right because capitalization matters! First name with a capital F is not the same as first name with a lower case f.
So E is also clearly incorrect.
:::
## filter()
filter(data_frame, ...)
::: notes
One of the most important `dplyr` functions to know about is `filter()`. `filter()` extracts rows, and it does that based on **logical criteria**, or a **condition** that can be evaluated to be true (keep that row as part of our subset) or false (don't keep that row).
Like `select()`, `filter()` takes a data frame as its first argument. The second argument is a condition or logical test. R then performs that logical test on each row of the dataset and returns all rows in which the logical test was true.
To extract rows that meet logical criteria, we write code that looks like this, and we replace the three dots with the condition we want to test for each row.
:::
## filter()
filter(covid_testing, mrn == 5000083)
![](media/filter_covid_example.png){fig-align="center"}
::: notes
To give you an example: the logical test here is whether or not the `mrn` value is equal to 5 00 00 83. This is **false** for the first three rows. In these rows, the `mrn` value is something else. For the 4th row, however, it is **true** that the `mrn` value is equal to 5 00 00 83.
This filter statement will return a data frame that only contains the 4th row, in which the logical condition is **true**, as shown on the right.
Notice that we're using a double equals here. That's very important!
:::
## A Potential Pitfall! {.smaller}
::: {.code-block .warning-red}
Error: Problem with `filter()` input `..1`. x Input `..1` is named. ℹ This usually means that you've used `=` instead of `==`.
:::
OR
::: {.code-block .warning-red}
Error: unexpected '='
:::
OR
::: {.code-block .warning-red}
invalid (do_set) left-hand side to assignment
:::
::: notes
One common issue to be aware of is the difference between the single equals and the double equals operators.
In R, using a single equals sign assigns a value. It demands, "make these things equal."
The double equals sign does not assign, but compares. It asks "**are** these things equal?".
That's why we use double equals in the context of a logical test that compares the left hand side, e.g. `mrn`, with the right hand side, e.g. 5000083, to check whether or not they are the same.
If you use the wrong kind of equals, you'll get an error. This is a very common mistake, and one you're almost guaranteed to accidentally commit at one point or another! This is what some of those scary errors look like:
:::
## Logical Operators {.smaller}
| logical expression | means | example |
|:------------------:|--------------------------|---------------------------|
| `x < y` | less than | `pan_day < 10` |
| `x > y` | greater than | `mrn > 5001000` |
| `x == y` | equal to | `first_name == last_name` |
| `x <= y` | less than or equal to | `mrn <= 5000000` |
| `x >= y` | greater than or equal to | `pan_day >= 30` |
| `x != y` | not equal to | `test_id != "covid"` |
| `is.na(x)` | a missing value | `is.na(clinic_name)` |
| `!is.na(x)` | not a missing value | `!is.na(pan_day)` |
::: notes
Here are some important logical operators to know about. They will all come in handy when you're filtering rows of a data frame. `x` and `y` each represent expressions, which could be column names or constant values or a combination thereof.
We've already seen the double equals. There are also the less than or and greater than operators. These operators also come as "or equal to" versions.
Use exclamation point equals (some people say "bang equals") if you want to select rows in which a value is **not** equal to another value.
`is.na()` is how you can test for missing values (`NA` in R). This comes in handy when you want to remove missing values from your data, which we'll see later.
:::
## Your Turn #2 {.smaller background-color="#e3faf1"}
Write a `filter()` statement that returns a data frame containing only the rows from `covid_testing` in which the `last_name` column is NOT equal to "stark".
(You don't have to capture the returned data frame)
Type your response in the chat!
```{r}
countdown(minutes = 1)
```
::: notes
For this, I just want you to think about how to write this, you don't have to test it in R. Write a filter statement that will give us a data frame with no starks in it. Read this over, and give it a shot!.
:::
------------------------------------------------------------------------
filter(covid_testing, last_name != "stark")
![](media/covid_filter_example.png)
::: notes
Great, so [name] was on the right track, and a few others I see also posted the right answer. Yep, the answer is filter, open parenthesis, covid_testing bang equal stark, in quotes, close parenthesis.
The logical operator for NOT equals is exclamation point equals. When we go through the covid_testing data frame row by row, we see this expression will be false for the first two rows where the last_name is "stark" and true for the last two rows where the last_name isn't "stark". The second point is that when you're doing a comparison with a literal character string, such as "stark", that needs to go into quotes.
:::
## Your Turn #3 {.smaller background-color="#e3faf1"}
Which of these would successfully filter the covid_testing data frame to only tests with positive results?
**A)** filter(covid_testing, result == positive)
**B)** filter(covid_testing, result = "positive")
**C)** filter(covid_testing, result == "positive")
**D)** filter(covid_testing, positive == "result")
```{r}
countdown(minutes = 1)
```
::: notes
Here we have another multiple choice to see if you're on your toes. Only one of these is correct? Which one? Post what you think in chat.
------------------------------------------------------------------------
A is wrong because "positive" is a character string (it's not a number or a logical value such as TRUE/FALSE). B is wrong because you're trying to do a comparison with a single equals. C is correct! D is wrong because it flips the positions of the comparison; the column name goes to the left and the comparator on the right.
:::
## The Pipe Operator %\>% {.section-break background-image="media/cork-board.png" background-size="400px" background-repeat="repeat"}
::: notes
Let's talk about the pipe operator, which we can use to build pipelines!
:::
## The Pipe Operator %\>%
The pipe operator we'll use is `%>%`
(You can also use `|>`, in R 4.1.0 forward)
::: notes
One of the most powerful concepts in the `tidyverse` suite of packages is the pipe operator, which is written in two possible ways:
- percent, greater than, percent (`%>%`) (this is the original pipe which gets included as part of `dplyr` and `tidyverse`)
- vertical pipe, greater than (`|>`) (this is a newer option, and is now "native", meaning it comes from base R, if you're using R version 4.1.0 or later)
We're going to use the original pipe, for two reasons:
1) There are very specific occasions, which we won't run into today, in which the older pipe and the newer pipe do different things.
2) We think the older pipe is still going to be what you see most, at least for maybe another year.
Still, we want you to know that a newer version of the pipe exists and you might see it or use it in the future! It works in an almost identical way.
:::
## The Pipe Operator {.smaller}
![](media/pipe_mini.png){.half fig-align="center"}
::: small-text
Passes the **object on the left** as the **first argument** to the **function on the right**
`covid_testing %>% filter(pan_day <= 10)` is equivalent to `filter(covid_testing, pan_day <= 10)`
OR, if you in the future use the "new" pipe:
`covid_testing |> filter(pan_day <= 10)` is equivalent to `filter(covid_testing, pan_day <= 10)`
:::
::: notes
Both pipe operators pass the **object on its left** as the **first argument** to the **function on its right**.
In this workshop, **we'll use the "original" pipe** (that's the one that has percent greater than percent) in code examples and quiz questions, because we think this is the one you'll see the most in code that your coworkers share with you or you find in online examples. We're also running on the latest stable version of R that ships with our server software, which is 3.6. This will gradually change, and when we get 4.1.0 as the default R version, we'll likely change these materials to reflect that.
That means, and I'm going to read the top line of code in blue, that the statement "covid_testing, pipe, filter such that pan_day is less than or equal to ten" is equivalent to "filter the covid_testing data frame such that pan day is less than or equal to ten". Those two lines of code are equivalent.
In both cases we're taking the `covid_testing` data frame, passing it as the first argument to the `filter()` function, and adding a condition that we're filtering by. In our case that condition is `pan_day` less than or equal to 10.
We could say the same thing of the second line of blue code on your screen which uses the newer pipe.
This is the last time you'll see that new pipe today... from here out we're going to use the old favorite percent greater than percent.
:::
------------------------------------------------------------------------
![](media/covid_pipeline_example.png){.two-thirds fig-align="center"}
- Start with the `covid_testing` data frame. **THEN**
- Select so that we get only certain columns. **THEN**
- Filter so that we get only certain rows.
::: notes
Here's why the pipe (`%>%` or `|>`) is so useful.
"Tidy" functions like `select()`, `filter()`, and others we'll see later always have as first argument a data frame, and they always return a data frame as well. Data frame in, data frame out.
This makes it possible to create a **pipeline** in which a data frame object is handed from one `dplyr` function to the next. The data frame result of step 1 becomes the data frame starting point for step 2, then the result of step 2 becomes the starting point for step 3, and so on.
For example, here we start with `covid_testing`, then `select` the `last_name` and `result` columns, then `filter` to get rows where `result` is equal to "positive".
You might wonder why we've put each step in its own line. Is this a requirement? No, it's not. Many R users like to use **whitespace** (new lines, tabs, spaces, indents) to make their code more human readable.
By connecting logical steps, you can get a **pipeline** of data analysis steps which are concise and also fairly human readable. You can think of the pipe symbol as the word "then...", describing the steps in order.
This approach to coding is powerful because it makes it much easier for someone who doesn't know R well to read and understand your code as a series of instructions.
:::
## Your Turn #4 {.smaller background-color="#e3faf1"}
Rewrite the following statement with a pipe:
`select(mydata, first_name, last_name)`
Type the answer in the chat!
```{r}
countdown(minutes = 1)
```
::: notes
OK, I want to see if you grasp this concept, as it's pretty important, moving forward. How would you rewrite the statement on your screen, select mydata comma first name comma last name, and use the pipe syntax instead? Share what you think the answer is.
...
Yep, that's exactly right! You'd write mydata, the pipe symbol, and then select first name comma last name. Any questions on that?
:::
## Create or Update Columns {.section-break background-image="media/cork-board.png" background-size="400px" background-repeat="repeat"}
::: notes
Let's say you want to add a new column to your data frame, or update a column by changing it in some way (say, convert kilograms to pounds). dplyr has a function for that, too!
:::
## mutate()
Create new or updated, optionally calculated columns.
![](media/mutate_0.png)
::: notes
`mutate()` is an extremely useful `dplyr` function, and you can use it to make new variables / columns. That's what we'll use it for here. You can also use `mutate()` to change existing columns (say, turn an entire column lowercase or round or scale a numeric value).
Like all `dplyr` functions, `mutate()` takes a data frame as its first argument. After that, you tell it what to name the new column and what should be in it. This is done using **name-value expressions**.
In **name-value expression**, you have:
- a name
- an equals sign (`=`), and
- a value
:::
## mutate()
Create new or updated, optionally calculated columns.
![](media/mutate_1.png)
::: notes
The **name** is the name of the column that you'd like to create or update.
:::
## mutate()
Create new or updated, optionally calculated columns.
![](media/mutate_2.png)
::: notes
Then you have a **single equals sign** - because you're assigning a value (`=`), you're not asking whether two things are equal (`==`).
:::
## mutate()
Create new or updated, optionally calculated columns.
![](media/mutate_3.png)
::: notes
Then you have **value**. This can be a constant, e.g. 100, or a calculation that involves data from already existing columns.
:::
## mutate()
mutate(covid_testing,
col_rec_tat_mins = col_rec_tat * 60)
![](media/mutate_covid_example.png)
::: notes
For example, let's take a look at one of the columns of `covid_testing` that we haven't looked at yet in this workshop: `col_rec_tat`.
This column contains the specimen collection ("col") to received-in-lab ("rec") turn around time ("tat"), in hours. Let's create a new column, that contains the same data, but in minutes instead of hours.
To do so, you write `mutate(covid_testing,` followed by a name-value expression. The left part is the new column name, which we could choose to be `col_rec_tat_mins`. Then we have a single equals sign. Then the calculation, which is `col_rec_tat` times 60.
:::
## mutate()
mutate(covid_testing,
ct_value = round(ct_value))
::: notes
If, on the other hand, you wanted to change an **existing** column using `mutate()`, you could do it like this. This command takes the column ct_value, which currently holds decimal values, rounds it to the nearest whole number, and then uses that as the new set of values for ct_value.
:::
## Your Turn #5 {.smaller background-color="#e3faf1"}
Open `03 – Transform.qmd` and work through the exercises for the section that says "Your Turn #5."
Click "thumbs up" when you are finished.
```{r}
countdown(minutes = 5)
```
::: notes
Now let's do some hands-on work. Please go to your "exercises" folder and open the 03 transform file. You'll have five minutes to go through the first part of that file. There's a place where it says "stop here...", so please do!
...
Everyone ready? I'm going to go through the solutions. In this first exercise, I'll start with covid_testing, then add a pipe, and then use my filter, making sure I use the double equal. So clinic_name == "picu". Finally, I'll add another pipe and then keep only the columns I care about, using select(rec_ver_tat, pan_day).
Then I'll use mutate without a pipe and make a new column composed of the sum of two existing columns. I'll do it like this: mutate covid_testing comma total_tat equals col_rec_tat plus rec_ver_tat.
And finally, I'll take the data frame name out of that mutate and use it as the start of a pipeline. So I have covid_testing, then, mutate, total_tat equals col_rec_tat plus rec_ver_tat.
:::
## Group By and Summarize
A very common use case is to divide your data into groups, and get information about each group.
For this, we'll use `group_by` and `summarize`.
![](media/group_by_summarise_mini.png){.half fig-align="center"}
::: notes
Group by combined with summarize is a way for us to lump cases together and then get a statistic for each group. For example, maybe you want the median blood sugar for girls and the median blood sugar for boys in your study, or the maximum wait time for King of Prussia emergency department patients and the maximum wait time for University City emergency department patients.
When you use group by, you have to tell R how to separate your cases into groups. In the image here, there are three groups, each of which is represented by a different shade of gold. Any variable that is categorical data can be used to group. For example, you can group by sex, or race, or zip code. Maybe these three groups are three states, like New Jersey, Pennsylvania, and Delaware!
Once you have your data in groups, you can then use the summarize command to get summary statistics for each group. The summary for each group is represented in blue in this small image.
Summarizing can take lots of different forms! Sometimes you want to know how big the group is, how many members it has. Sometimes you want to know what the average value of something is per group, or what the maximum value is. You can also summarize and give several different measures for each group, like maximum, minimum, mean, and median. It looks like in this image there are two values given for each group. Maybe we have two values for New Jersey, Pennsylvania and Delaware, like the number of patients we have in each state and the number of patients in each state using Medicaid.
:::
## Additional Practice (Time Permitting)
If time permits:
Open `03 – Transform.qmd` and work through the exercises for the section that says "Your Turn #6. We'll do this together!
::: notes
Say one of these, depending on your time frame:
This is an optional exercise, and I think we have time to do it!
OR
We don't have time to do this optional exercise right now, but you have access to your project and these slides, and this might be fun for you to do in your own time later).
Please go to your "exercises" folder and open the 03 transform file. Give me that thumbs up when you're in that file, and you are looking at the section called "Your Turn 6." We haven't had a chance to talk about group by and aggregation yet, so this is a section of code we'll work on together.
Everyone ready?
Great, let's begin by reading the instructions, which tell us we want to compare different clinics.
> You are interested in understanding the relative utilization of COVID tests and the range of turnaround times across locations, which are captured with the clinic_name variable. Use group_by and summarize to calculate:
> a\) The number of orders ordered by each clinic/unit, creating a new summary variable num_orders
> b\) The median total turnaround time for each clinic/unit (using the total_tat variable you created in the previous exercise), using a new summary variable median_tat.
We're also given a hint in the exercise file. It tells us, "The function to calculate a median is (predictably) median(...)."
Let's start by adding our starting data frame. What's the name of our starting data frame again? The one with all the data in it? That's right, covid\_testing. I'm adding that to the first line of this chunk.
And we want to use group\_by.
When you use group by, you need to tell R what variable you're using to separate data into groups. Here, we want to get a statistic for each clinic, so we'll add `group_by(clinic_name)` in the second part of this pipeline. That will form one group for each clinic name.
What were we asked to do? Let's look at letter (a) again. We were asked to find the number of orders ordered by each clinic/unit, creating a new summary variable `num_orders`.
We'll want to put some kind of function here that counts the number of members in each group. In R, a very easy way to do this is to use the `n` function. Here I'm just going to put n, followed by an open and closed parenthesis.
But we were also asked to get a second summarizing variable for each group. What was that? Well, in letter (b) we read that we're asked to calculate median total turnaround time for each clinic/unit (using the total_tat variable you created in the previous exercise), using a new summary variable median_tat.
We were also told that we could use the function `median()` to calculate a median value. So how do you think we can calculate the median total TAT for each group? Type into chat if you think you know!
...
That's right! We're going to add `median(total_tat)` as the value of our new variable, `median_tat`.
Now let's run this chunk and see if I typed everything correctly! Great! Now we have values by groups, and we set the grouping to be each clinic having its own group.
Group by and aggregation functions together do a great job of helping you characterize your data and notice groupwise differences, whether that's by location, sex, race, disease status, or some other category!
:::
## Recap {.smaller}
::: {.columns .centered width="80%"}
::: {.column width="25%"}
![](media/select_mini.png){.two-thirds fig-align="center"}
:::
::: {.column width="75%"}
`select()` subsets columns by name
:::
:::
::: {.columns .centered width="80%"}
::: {.column width="25%"}
![](media/filter_mini.png){.two-thirds fig-align="center"}
:::
::: {.column width="75%"}
`filter()` subsets rows by a logical condition
:::
:::
::: {.columns .centered width="80%"}
::: {.column width="25%"}
![](media/mutate_mini.png){fig-align="center"}
:::
::: {.column width="75%"}
`mutate()` creates new calculated columns or changes existing columns
:::
:::
::: {.columns .centered width="80%"}
::: {.column width="25%"}
![](media/pipe_mini_1.png){fig-align="center"}
:::
::: {.column width="75%"}
Use the pipe operator `%>%` to combine dplyr functions into a pipeline
:::
:::
::: {.columns .centered width="80%"}
::: {.column width="25%"}
![](media/group_by_summarise_mini.png){.two-thirds fig-align="center"}
:::
::: {.column width="75%"}
`group_by()` with `summarize()` gives per-group statistics
:::
:::
::: notes
To recap, `dplyr` is a package you can load in R that provides a grammar for transforming data frames. Some of the key `dplyr` functions are:
`select()`, which subsets columns by name `filter()`, which subsets rows by a logical condition, and `mutate()`, which creates new calculated columns or changes existing columns
Additionally, `dplyr` and other `tidyverse` packages make use of the pipe operator, which can be used to string together `dplyr` functions into a pipeline that performs several transformations.
Finally, `group_by()` and `summarize()` work together to allow you to calculate per-group summary statistics.
:::
## What Else? {.section-break background-image="media/cork-board.png" background-size="400px" background-repeat="repeat"}
## Cheatsheet (more dplyr functions!)
![](media/dplyr_cheatsheet_snapshot.png){.two-thirds .bordered fig-align="center"}
::: notes
RStudio creates and distributes a number of cheatsheets for various purposes. You can find them by clicking in the **Help menu** in RStudio -- try that now! Here's an image of the `dplyr` cheatsheet. As you can see, there are lots of other functions that dplyr offers.
Other `dplyr` functions include `arrange()`, `distinct()`, `group_by()` (which is especially helpful when combined with `summarize()`), and many more!
:::
------------------------------------------------------------------------
![](media/tidyverse_logos.png){.two-thirds fig-align="center"}
::: notes
Beyond dplyr, there are a number of other [`tidyverse`](https://www.tidyverse.org/) packages that provide powerful tools for data transformation:
- `tidyr` provides functions that allow you to convert messy data frames into tidy ones
- `lubridate` provides functions to manipulate times and dates
- `stringr` provides tools for manipulating text strings
- `purrr` offers advanced functionality to automate complex data transformations
- `dbplyr` allows you to interact with a table inside a database as if it were a data frame
:::
## Next Up: Final Notes
If you want to look at Dashboards, a section we have decided to cut for time, you can find that here: [Dashboards](https://joy-payton-chop.quarto.pub/intro-to-r-for-clinical-data-2023/dashboards).
But we'll be moving on to [Final Notes](https://joy-payton-chop.quarto.pub/intro-to-r-for-clinical-data-2023/final).
::: notes
We've given this workshop a few times, and we think that it makes sense to have a relaxed pace that teaches you slowly. We've decided not to teach the fourth set of slides, the one on Dashboards, this time around. But the material is there for you if you want to look at it and practice with it! You can even look at the speaker notes from when we've taught this previously.
Instead, we're going to jump right to our final notes. Joy will close things out for us today with some next steps for growing in R!
:::