forked from geanders/RProgrammingForResearch
-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-reproducibleresearch1.Rmd
540 lines (371 loc) · 26.2 KB
/
05-reproducibleresearch1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
# Reproducible research #1
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week5.pdf) a pdf of the lecture slides covering this topic.
```{r echo = FALSE}
library(faraway)
data(nepali)
```
## What is reproducible research?
A data analysis is **reproducible** if all the information (data, files, etc.) required is available for someone else to re-do your entire analysis. This includes:
- Data available
- All code for cleaning raw data
- All code and software (specific versions, packages) for analysis
Some advantages of making your research reproducible are:
- You can (easily) figure out what you did six months from now.
- You can (easily) make adjustments to code or data, even early in the process, and re-run all analysis.
- When you're ready to publish, you can (easily) do a last double-check of your full analysis, from cleaning the raw data through generating figures and tables for the paper.
- You can pass along or share a project with others.
- You can give useful code examples to people who want to extend your research.
Here is a famous research example of the dangers of writing code that is hard to double-check or confirm:
- [The Economist](http://www.economist.com/node/21528593)
- [The New York Times](http://www.nytimes.com/2011/07/08/health/research/08genes.html?_r=0)
- [Simply Statistics](http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/)
Some of the steps required to making research reproducible are:
- All your raw data should be saved in the project directory. You should have clear documentation on the source of all this data.
- Scripts should be included with all the code used to clean this data into the data set(s) used for final analyses and to create any figures and tables.
- You should include details on the versions of any software used in analysis (for R, this includes the version of R as well as versions of all packages used).
- If possible, there should be no "by hand" steps used in the analysis; instead, all steps should be done using code saved in scripts. For example, you should use a script to clean data, rather than cleaning it by hand in Excel. If any "non-scriptable" steps are unavoidable, you should very clearly document those steps.
There are several software tools that can help you improve the reproducibility of your research:
- **`knitr`**: Create files that include both your code and text. These can be rendered to create final reports and papers. They keep code within the final file for the report.
- **`knitr` complements**: Create fancier tables and figures within RMarkdown documents. Packages include `tikzDevice`, `animate`, `xtables`, and `pander`.
- **`packrat`**: Save versions of each package used for the analysis, then load those package versions when code is run again in the future.
In this section, I will focus on using `knitr` and RMarkdown files.
## Markdown
R Markdown files are mostly written using Markdown. To write R Markdown files, you need to understand what markup languages like Markdown are and how they work.
In Word and other word processing programs you have used, you can add formatting using buttons and keyboard shortcuts (e.g., "Ctrl-B" for bold). The file saves the words you type. It also saves the formatting, but you see the final output, rather than the formatting markup, when you edit the file (WYSIWYG -- what you see is what you get).
In markup languages, on the other hand, you markup the document directly to show what formatting the final version should have (e.g., you type `**bold**` in the file to end up with a document with **bold**).
Examples of markup languages include:
- HTML (HyperText Markup Language)
- LaTex
- Markdown (a "lightweight" markup language)
For example, Figure \@ref(fig:htmlexample) some marked-up HTML code from CSU's website, while Figure \@ref(fig:renderedexample) shows how that file looks when it's rendered by a web browser.
```{r htmlexample, echo = FALSE, out.width = "\\textwidth", fig.align = "center", fig.cap = "Example of the source of an HTML file."}
knitr::include_graphics("figures/example_html.png")
```
```{r renderedexample, echo = FALSE, out.width = "\\textwidth", fig.align = "center", fig.cap = "Example of a rendered HTML file."}
knitr::include_graphics("figures/example_output.png")
```
To write a file in Markdown, you'll need to learn the conventions for creating formatting. This table shows what you would need to write in a flat file for some common formatting choices:
```{r echo = FALSE}
markdown_format <- data.frame(Code = c("`**text**`",
"`*text*`",
"`[text](www.google.com)`",
"`# text`",
"`## text`"),
Rendering = c("**text**",
"*text*",
"[text](www.google.com)",
"",
""),
Explanation = c("boldface",
"italicized",
"hyperlink",
"first-level header",
"second-level header"))
knitr::kable(markdown_format)
```
Some other simple things you can do in Markdown include:
- Lists (ordered or bulleted)
- Equations
- Tables
- Figures from file
- Block quotes
- Superscripts
For more Markdown conventions, see [RStudio's R Markdown Reference Guide](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf) (link also available through "Help" in RStudio).
## Literate programming in R
**Literate programming**, an idea developed by Donald Knuth, mixes code that can be executed with regular text. The files you create can then be rendered, to run any embedded code. The final output will have results from your code and the regular text.
The `knitr` package can be used for literate programming in R. In essence, `knitr` allows you to write an R Markdown file that can be rendered into a pdf, Word, or HTML document.
Here are the basics of opening and rendering an R Markdown file in RStudio:
- To open a new R Markdown file, go to "File" -> "New File" -> "RMarkdown..." -> for now, chose a "Document" in "HTML" format.
- This will open a new R Markdown file in RStudio. The file extension for R Markdown files is ".Rmd".
- The new file comes with some example code and text. You can run the file as-is to try out the example. You will ultimately delete this example code and text and replace it with your own.
- Once you "knit" the R Markdown file, R will render an HTML file with the output. This is automatically saved in the same directory where you saved your .Rmd file.
- Write everything besides R code using Markdown syntax.
To include R code in an RMarkdown document, you need to separate off the code chunk using the following syntax:
`r ''````{r}
my_vec <- 1:10
```
This syntax tells R how to find the start and end of pieces of R code when the file is rendered. R will walk through, find each piece of R code, run it and create output (printed output or figures, for example), and then pass the file along to another program to complete rendering (e.g., Tex for pdf files).
You can specify a name for each chunk, if you'd like, by including it after "r" when you begin your chunk. For example, to give the name `load_nepali` to a code chunk that loads the `nepali` dataset, specify that name in the start of the code chunk: \bigskip
`r ''````{r load_nepali}
library(faraway)
data(nepali)
```
Here are a couple of tips for naming code chunks:
- Chunk names must be unique across a document.
- Any chunks you don't name are given numbers by `knitr`.
You do not have to name each chunk. However, there are some advantages:
- It will be easier to find any errors.
- You can use the chunk labels in referencing for figure labels.
- You can reference chunks later by name.
You can add options when you start a chunk. Many of these options can be set as TRUE / FALSE and include:
```{r echo = FALSE}
chunk_opts <- data.frame(Option = c("`echo`",
"`eval`",
"`messages`",
"`warnings`",
"`include`"),
Action = c("Print out the R code?",
"Run the R code?",
"Print out messages?",
"Print out warnings?",
"If FALSE, run code, but don't print code or results"))
knitr::kable(chunk_opts)
```
Other chunk options take values other than TRUE / FALSE. Some you might want to include are:
```{r echo = FALSE}
chunk_opts2 <- data.frame(Option = c("`results`",
"`fig.width`",
"`fig.height`"),
Action = c("How to print results (e.g., `hide` runs the code, but doesn't print the results)",
"Width to print your figure, in inches (e.g., `fig.width = 4`)",
"Height to print your figure"))
pander::pander(chunk_opts2, split.cells = c(10, 50),
justify = c("center", "left"))
```
Add these options in the opening brackets and separate multiple ones with commas:
`r ''````{r messages = FALSE, echo = FALSE}
nepali[1, 1:3]
```
I will cover other chunk options later, once you've gotten the chance to try writting R Markdown files.
You can set "global" options at the beginning of the document. This will create new defaults for all of the chunks in the document. For example, if you want `echo`, `warning`, and `message` to be `FALSE` by default in all code chunks, you can run:
`r ''````{r global_options}
knitr::opts_chunk$set(echo = FALSE, message = FALSE,
warning = FALSE)
```
If you set both global and local chunk options that you set specifically for a chunk will take precedence over global options. For example, running a document with:
`r ''````{r global_options}
knitr::opts_chunk$set(echo = FALSE, message = FALSE,
warning = FALSE)
```
`r ''````{r check_nepali, echo = TRUE}
head(nepali, 1)
```
**would** print the code for the `check_nepali` chunk, because the option specified for that specific chunk (`echo = TRUE`) would override the global option (`echo = FALSE`).
You can also include R output directly in your text ("inline") using backticks: \bigskip
"There are `` `r '\x60r nrow(nepali)\x60'` `` observations in the `nepali` data set. The average age is `` `r '\x60r mean(nepali$age, na.rm = TRUE)\x60'` `` months."
\bigskip
Once the file is rendered, this gives: \bigskip
"There are `r nrow(nepali)` observations in the `nepali` data set. The average age is `r mean(nepali$age, na.rm = TRUE)` months."
\bigskip
Here are two tips that will help you diagnose some problems rendering R Markdown files:
- Be sure to save your R Markdown file before you run it.
- All the code in the file will run "from scratch"-- as if you just opened a new R session.
- The code will run using, as a working directory, the directory where you saved the R Markdown file.
You'll want to try out pieces of your code as you write an R Markdown document. There are a few ways you can do that:
- You can run code in chunks just like you can run code from a script (Ctrl-Return or the "Run" button).
- You can run all the code in a chunk (or all the code in all chunks) using the different options under the "Run" button in RStudio.
- All the "Run" options have keyboard shortcuts, so you can use those.
You can render R Markdown documents to other formats:
- Word
- Pdf (requires that you've installed "Tex" on your computer.)
- Slides (ioslides)
Click the button to the right of "Knit" to see different options for rendering on your computer.
You can freely post your RMarkdown documents at [RPubs](http://rpubs.com). If you want to post to RPubs, you need to create an account. Once you do, you can click the "Publish" button on the window that pops up with your rendered file. RPubs can also be a great place to look for interesting example code, although it sometimes can be pretty overwhelmed with MOOC homework.
If you'd like to find out more, here are two good how-to books on reproducible research in R (the CSU library has both in hard copy):
- *Reproducible Research with R and RStudio*, Christopher Gandrud
- *Dynamic Documents with R and knitr*, Yihui Xie
## Style guidelines
**R style guidelines** provide rules for how to format code in an R script. Some people develop their own style as they learn to code. However, it is easy to get in the habit of following style guidelines, and they offer some important advantages:
- Clean code is easier to read and interpret later.
- It's easier to catch and fix mistakes when code is clear.
- Others can more easily follow and adapt your code if it's clean.
- Some style guidelines will help prevent possible problems (e.g., avoiding `.` in function names).
For this course, we will use R style guidelines from two sources:
- [Google's R style guidelines](https://google.github.io/styleguide/Rguide.xml)
- [Hadley Wickham's R style guidelines](http://adv-r.had.co.nz/Style.html)
These two sets of style guidelines are very similar.
Hear are a few guidelines we've already covered in class:
- Use `<-`, not `=`, for assignment.
- Guidelines for naming objects:
+ All lowercase letters or numbers
+ Use underscore (`_`) to separate words, not camelCase or a dot (`.`) (this differs for Google and Wickham style guides)
+ Have some consistent names to use for "throw-away" objects (e.g., `df`, `ex`, `a`, `b`)
- Make names meaningful
+ Descriptive names for R scripts ("random_group_assignment.R")
+ Nouns for objects (`todays_groups` for an object with group assignments)
+ Verbs for functions (`make_groups` for the function to assign groups)
### Line length
Google: **Keep lines to 80 characters or less** \medskip
To set your script pane to be limited to 80 characters, go to "RStudio" -> "Preferences" -> "Code" -> "Display", and set "Margin Column" to 80.
```{r eval = FALSE}
# Do
my_df <- data.frame(n = 1:3,
letter = c("a", "b", "c"),
cap_letter = c("A", "B", "C"))
# Don't
my_df <- data.frame(n = 1:3, letter = c("a", "b", "c"), cap_letter = c("A", "B", "C"))
```
This guideline helps ensure that your code is formatted in a way that you can see all of the code without scrolling horizontally (left and right).
### Spacing
- Binary operators (e.g., `<-`, `+`, `-`) should have a space on either side
- A comma should have a space after it, but not before.
- Colons should not have a space on either side.
- Put spaces before and after `=` when assigning parameter arguments
```{r eval = FALSE}
# Do
shots_per_min <- worldcup$Shots / worldcup$Time
#Don't
shots_per_min<-worldcup$Shots/worldcup$Time
#Do
ave_time <- mean(worldcup[1:10 , "Time"])
#Don't
ave_time<-mean(worldcup[1 : 10 ,"Time"])
```
### Semicolons
Although you can use a semicolon to put two lines of code on the same line, you should avoid it.
```{r eval = FALSE}
# Do
a <- 1:10
b <- 3
# Don't
a <- 1:10; b <- 3
```
### Commenting
- For a comment on its own line, use `#`. Follow with a space, then the comment.
- You can put a short comment at the end of a line of R code. In this case, put two spaces after the end of the code, one `#`, and one more space before the comment.
- If it helps make it easier to read your code, separate sections using a comment character followed by many hyphens (e.g., `#------------`). Anything after the comment character is "muted".
```{r eval = FALSE}
# Read in health data ---------------------------
# Clean exposure data ---------------------------
```
### Indentation
Google:
- Within function calls, line up new lines with first letter after opening parenthesis for parameters to function calls:
Example:
```{r eval = FALSE}
# Relabel sex variable
nepali$sex <- factor(nepali$sex,
levels = c(1, 2),
labels = c("Male", "Female"))
```
### Code grouping
- Group related pieces of code together.
- Separate blocks of code by empty spaces.
```{r eval = TRUE}
# Load data
library(faraway)
data(nepali)
# Relabel sex variable
nepali$sex <- factor(nepali$sex,
levels = c(1, 2),
labels = c("Male", "Female"))
```
Note that this grouping often happens naturally when using tidyverse functions, since they encourage piping (`%>%` and `+`).
### Broader guidelines
- Omit needless code.
- Don't repeat yourself.
We'll learn more about satisfying these guidelines when we talk about writing your own functions in the next part of the class.
## More with knitr
### Equations in `knitr`
You can write equations in RMarkdown documents by setting them apart with dollar signs (`$`). For an equation on a line by itself (**display equation**), you two `$`s before and after the equation, on separate lines, then use LaTex syntax for writing the equations.
To help with this, you may want to use this [LaTex math cheat sheet.](http://reu.dimacs.rutgers.edu/Symbols.pdf). You may also find an online LaTex equation editor like [Codecogs.com](https://www.codecogs.com/latex/eqneditor.php) helpful.
Note: Equations denoted this way will always compile for pdf documents, but won't always come through on Markdown files (for example, GitHub won't compile math equations).
For example, writing this in your R Markdown file:
$$
E(Y_{t}) \sim \beta_{0} + \beta_{1}X_{1}
$$
will result in this rendered equation:
$$
E(Y_{t}) \sim \beta_{0} + \beta_{1}X_{1}
$$
To put math within a sentence (**inline equation**), just use one `$` on either side of the math. For example, writing this in a R Markdown file:
"We are trying to model $E(Y_{t})$."
The rendered document will show up as: \medskip
"We are trying to model $E(Y_{t})$."
### Figures from file
You can include not only figures that you create with R, but also figures that you have saved on your computer.
The best way to do that is with the `include_graphics` function in `knitr`:
```{r out.width = "0.3\\textwidth", fig.align = "center"}
library(knitr)
include_graphics("figures/CSU_ram.png")
```
This example would include a figure with the filename "MyFigure.png" that is saved in the "figures" sub-directory of the parent directory of the directory where your .Rmd is saved. Don't forget that you will need to give an absolute pathway or the relative pathway **from the directory where the .Rmd file is saved**.
### Saving graphics files
You can save figures that you create in R. Typically, you won't need to save figures for an R Markdown file, since you can include figure code directly. However, you will sometimes want to save a figure from a script. You have two options:
- Use the "Export" choice in RStudio
- Write code to export the figure in your R script
To make your research more reproducible, use the second choice.
To use code export a figure you created in R, take three steps:
1. Open a graphics device (e.g., `pdf("MyFile.pdf")`).
2. Write the code to print your plot.
3. Close the graphics device using `dev.off()`.
For example, the following code would save a scatterplot of time versus passes as a pdf named "MyFigure" in the "figures" subdirectory of the current working directory:
```{r eval = FALSE}
pdf("figures/MyFigure.pdf", width = 8, height = 6)
ggplot(worldcup, aes(x = Time, y = Passes)) +
geom_point(aes(color = Position)) +
theme_bw()
dev.off()
```
If you create multiple plots before you close the device, they'll all save to different pages of the same pdf file.
You can open a number of different graphics devices. Here are some of the functions you can use to open graphics devices:
- `pdf`
- `png`
- `bmp`
- `jpeg`
- `tiff`
- `svg`
You will use a device-specific function to open a graphics device (e.g., `pdf`). However, you will always close these devices with `dev.off`.
Most of the functions to open graphics devices include parameters like `height` and `width`. These can be used to specify the size of the output figure. The units for these depend on the device (e.g., inches for `pdf`, pixels by default for `png`). Use the helpfile for the function to determine these details.
### Tables in R Markdown
If you want to create a nice, formatted table from an R dataframe, you can do that using `kable` from the `knitr` package.
```{r}
my_df <- data.frame(letters = c("a", "b", "c"),
numbers = 1:3)
kable(my_df)
```
There are a few options for the `kable` function:
```{r echo = FALSE, results = 'asis'}
kable_opts <- data.frame(arg = c("`colnames`",
"`align`",
"`caption`",
"`digits`"),
expl = c("Column names (default: column name in the dataframe)",
"A vector giving the alignment for each column ('l', 'c', 'r')",
"Table caption",
"Number of digits to round to. If you want to round columns different amounts, use a vector with one element for each column."))
pander::pandoc.table(kable_opts, split.cells = c(10, 50),
justify = "rl", style = "multiline")
```
```{r}
my.df <- data.frame(letters = c("a", "b", "c"),
numbers = rnorm(3))
kable(my.df, digits = 2, align = c("r", "c"),
caption = "My new table",
col.names = c("First 3 letters",
"First 3 numbers"))
```
From Yihui:
>"**Want more features?** No, that is all I have. You should turn to other packages for help. I'm not
going to reinvent their wheels."
If you want to do fancier tables, you may want to explore the `xtable` and `pander` packages. As a note, these might both be more effective when compiling to pdf, rather than html.
## In-course exercise
For all of today's tasks, you'll use the code from last week's in-course exercise to do the exercises. This week we are not focusing on writing new code, but rather on how to take R code and put it in an R Markdown file, so we can create reports from files that include the original code.
### Creating a Markdown document
First, you'll create a Markdown document, without any R code in it yet.
In RStudio, go to "File" -> "New File" -> "R Markdown". From the window that brings up, choose "Document" on the left-hand column and "HTML" as the output format. A new file will open in the script pane of your RStudio session. Save this file (you may pick the name and directory). The file extension should be ".Rmd".
First, before you try to write your own Markdown, try rendering the example that the script includes by default. (This code is always included, as a template, when you first open a new RMarkdown file using the RStudio "New file" interface we used in this example.) Try rendering this default R Markdown example by clicking the "Knit" button at the top of the script file.
For some of you, you may not yet have everything you need on your computer to be able to get this to work. If so, let me know. RStudio usually includes all the necessary tools when you install it, but there may be some exceptions.
If you could get the document to knit, do the following tasks:
- Look through the HTML document that was created. Compare it to the R Markdown script that created it, and see if you can understand, at least broadly, what's going on.
- Look in the directory where you saved the R Markdown file. You should now also see a new, .html file in that folder. Try opening it with a web browser like Safari.
- Go back to the R Markdown file. Delete everything after the initial header information (everything after the 6th line). In the header information, make sure the title, author, and date are things you're happy with. If not, change them.
- Using Markdown syntax, write up a description of the data (`worldcup`) we used last week to create the fancier figure. Try to include the following elements:
+ Bold and italic text
+ Hyperlinks
+ A list, either ordered or bulleted
+ Headers
### Adding in R code
Now incorporate the R code from previous weeks' exercises into your document. Once you get the document to render with some basic pieces of code in it, try the following:
- Try some different chunk options. For example, try setting `echo = FALSE` in some of your code chunks. Similarly, try using the options `results = "hide"` and `include = FALSE`.
- You should have at least one code chunk that generates figures. Try experimenting with the `fig.width` and `fig.height` options for the chunk to change the size of the figure.
- Try using the global commands. See if you can switch the `echo` default value for this document from TRUE (the usual default) to FALSE.
### Working with R Markdown documents
Finally, try the following tasks to get some experience working with R Markdown files in RStudio:
- Go to one of your code chunks. Locate the small gray arrow just to the left of the line where you initiate the code chunk. Click on it and see what happens. Then click on it again.
- Put your cursor inside one of your code chunks. Try using the "Run" button (or Ctrl-Return) to run code in that chunk at your R console. Did it work?
- Pick a code chunk in your document. Put your cursor somewhere in the code in that chunk. Click on the "Run" button and choose "Run All Chunks Above". What did that do? If it did not work, what do you think might be going on? (Hint: Check `getwd()` and think about which directory you've used to save your R Markdown file.)
- Pick another chunk of code. Put the cursor somewhere in the code for that chunk. Click on the "Run" button and choose "Run Current Chunk". Then try "Run Next Chunk". Try to figure out all the options the "Run" button gives you and when each might be useful.
- Click on the small gray arrow to the right of the "Knit HTML" button. If the option is offered, select "Knit Word" and try it. What does this do?
### R style guidelines
Go through all the R code in your R Markdown file. Are there are places where your code is not following style conventions for R? Clean up your code to correct any of these issues.