-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathD4_Graphics.Rmd
189 lines (143 loc) · 5.45 KB
/
D4_Graphics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: "R Work Flows and Data Visualization"
author: "Martin Morgan <[email protected]> & Sean Davis <[email protected]>"
vignette: >
% \VignetteIndexEntry{A.4 -- R Workflows and Data Visualization}
% \VignetteEngine{knitr::rmarkdown}
---
```{r style-A4, echo = FALSE, results = 'asis'}
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```
# Using _R_ in real life
## Organizing work
Usually, work is organized into a directory with:
- A folder containing _R_ scripts (`scripts/BRFSS-visualize.R`)
- 'External' data like the csv files that we've been working with,
usually in a separate folder (`extdata/BRFSS-subset.csv`)
- (sometimes) _R_ objects written to disk using `saveRDS()` (`.rds`
files) that represent final results or intermediate 'checkpoints'
(`extdata/ALL-cleaned.rds`). Read the data into an _R_ session using
`readRDS()`.
- Use `setwd()` to navigate to folder containing scripts/, extdata/ folder
- Source an entire script with `source("scripts/BRFSS-visualization.R")`.
_R_ can also save the state of the current session (prompt when
choosing to `quit()` _R_), and to view and save the `history()` of the
the current session; I do not find these to be helpful in my own work
flows.
## _R_ Packages
All the functionality we have been using comes from _packages_
that are automatically _loaded_ when _R_ starts. Loaded packages are on
the `search()` path.
```{r search}
search()
```
Additional packages may be _installed_ in _R_'s libraries. Use
`installed.packages() or the _RStudio_ interface to see installed
packages. To use these packages, it is necessary to attach them to the
search path, e.g., for survival analysis
```{r}
library("survival")
```
There are many thousands of _R_ packages, and not all of them are
installed in a single installation. Important repositories are
- CRAN: https://cran.r-project.org/
- Bioconductor: https://bioconductor.org/packages
Packages can be discovered in various ways, including
[CRAN Task Views][] and the [_Bioconductor_ web][] and
[_Bioconductor_ support][] sites.
To install a package, use `install.packages()` or, for _Bioconductor_
packages, instructions on the package landing page, e.g., for
[GenomicRanges][]. Here we install the [ggplot2][] package.
```{r eval=FALSE}
install.packages("ggplot2", repos="https://cran.r-project.org")
```
A package needs to be installed once, and then can be used in any _R_
session.
[CRAN Task Views]: https://cran.r-project.org/web/views/
[_Bioconductor_ web]: https://bioconductor.org
[_Bioconductor_ support]: https://support.bioconductor.org
[GenomicRanges]: https://bioconductor.org/packages/GenomicRanges
[ggplot2]: https://cran.r-project.org/package=ggplot2
# Graphics and Visualization
Load the BRFSS-subset.csv data
<!--
```{r}
path <- "BRFSS-subset.csv"
brfss <- read.csv(path)
```
-->
```{r, eval=FALSE}
path <- "BRFSS-subset.csv" # or file.choose()
brfss <- read.csv(path)
```
Clean it by coercing `Year` to factor
```{r}
brfss$Year <- factor(brfss$Year)
```
## Base _R_ Graphics
Useful for quick exploration during a normal work flow.
- Main functions: `plot()`, `hist()`, `boxplot()`, ...
- Graphical parameters -- see `?par`, but often provided as arguments
to `plot()`, etc.
- Construct complicated plots by layering information, e.g., points,
regression line, annotation.
```{r}
brfss2010Male <- subset(brfss, (Year == 2010) & (Sex == "Male"))
fit <- lm(Weight ~ Height, brfss2010Male)
plot(Weight ~ Height, brfss2010Male, main="2010, Males")
abline(fit, lwd=2, col="blue")
points(180, 90, pch=20, cex=3, col="red")
```
- Approach to complicated graphics: create a grid of panels (e.g.,
`par(mfrows=c(1, 2))`, populate with plots, restore original layout.
```{r}
brfssFemale <- subset(brfss, Sex=="Female")
opar = par(mfrow=c(2, 1)) # layout: 2 'rows' and 1 'column'
hist( # first panel -- 1990
brfssFemale[ brfssFemale$Year == 1990, "Weight" ],
main = "Female, 1990")
hist( # second panel -- 2010
brfssFemale[ brfssFemale$Year == 2010, "Weight" ],
main = "Female, 2010")
par(opar) # restore original layout
```
## What makes for a good graphical display?
- Common scales for comparison
- Efficient use of space
- Careful color choice -- qualitative, gradient, divergent schemes;
color blind aware; ...
- Emphasis on data rather than labels
- Convey statistical uncertainty
## Grammar of Graphics: ggplot2
```{r}
library(ggplot2)
```
- http://docs.ggplot2.org
'Grammar of graphics'
- Specify data and 'aesthetics' (`aes()`) to be plotted
- Add layers (`geom_*()`) of information
```{r, warning=FALSE}
ggplot(brfss2010Male, aes(x=Height, y=Weight)) +
geom_point() +
geom_smooth(method="lm")
```
- Capture a plot and augment it
```{r, warning=FALSE}
plt <- ggplot(brfss2010Male, aes(x=Height, y=Weight)) +
geom_point() +
geom_smooth(method="lm")
plt + labs(title = "2010 Male")
```
- Use `facet_*()` for layouts
```{r, warning=FALSE}
ggplot(brfssFemale, aes(x=Height, y=Weight)) +
geom_point() + geom_smooth(method="lm") +
facet_grid(. ~ Year)
```
- Choose display to emphasize relevant aspects of data
```{r, warning=FALSE}
ggplot(brfssFemale, aes(Weight, fill=Year)) +
geom_density(alpha=.2)
```