generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-graphs.Rmd
205 lines (169 loc) · 10.9 KB
/
02-graphs.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
output:
pdf_document: default
html_document: default
---
# Effective use of graphs
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, comment=NA, fig.align='center', fig.width=5.5,fig.height=5.5)
```
## General principles
- Focus the eye on features that are important
- Avoid distracting features
- Lines that are intended to attract attention can be thickened
- Where points should be the focus, make them large & dark
- It often makes sense to de-emphasize the axes.
- If points are numerous and there is substantial overlap, use open symbols, and/or use symbols that have some degree of transparency.
- Different choices of color palettes are effective for different purposes.
- Bear in mind that the eye has difficulty in focusing simultaneously on widely separated colors that are close together on the same graph.
## Varying time intervals --- show rates, not counts
A graph that was essentially the solid segmented solid line in Figure \@ref(fig:Nobel) appeared in @national1975science "Science Indicators, 1974". The segmented line gives a highly misleading impression for the four years 1971-1974, as opposed to earlier points, where numbers are totals over decades. It joins up a final point that is a different measure from earlier points.
The gray dots, and the axis on the right, show rates per year, thus comparing like with like.
```{r cap1, echo=FALSE}
cap1 <- "The black line shows numbers of US Nobel prizes, for given time intervals. The gray dots. with the right axis label, show average per year."
```
```{r Nobel, fig.width=6.5, fig.asp=0.575, fig.show="hold", out.width="80%", fig.pos='H', fig.cap=cap1}
par(mar=c(4.1,3.1,2.1,3.1),mgp=c(2.25,0.5,0))
n2 <- c(1,1,4,12,15,29,27,11)
n1 <- c(1,1,4,12,15,29,27,39)
plot(1:8, n2, xaxt="n", xlab="", ylim=range(n1), lwd=2,
ylab="Number of Nobel prizes", type="l", fg='gray')
axis(1, at=1:8, labels=paste0(seq(from=1901, to=1971, by=10),"-"),
cex.axis=0.9, gap.axis=0.15, col="gray")
axis(1, at=1:8, labels=paste0(c(seq(from=1910, to=1970, by=10),1974)),
line=0.8, lty=0, cex.axis=0.9, col="gray")
axis(1, at=1:7, labels=FALSE, col="gray")
mtext(side=3, line=0.5, at=0.5, "Total prizes (black line)", adj=0, cex=1.25)
mtext(side=3, line=0.5, at=8.5, "Nobel prizes per year (gray dots)", col='gray60', adj=1, cex=1.25)
points(1:8, n1, lwd=8, col="gray", pch=1)
lines(7:8, tail(n1,2), lty=2,lwd=2, col="gray")
axis(4, at=(0:4)*10, labels=paste(0:4),
col.ticks="gray", col='gray')
mtext(side=4, line=1.8, "Nobel prizes per year", col='gray60')
```
The same principle applies for intervals of measures other than time --- for example of length or volume.
## Banking --- the importance of aspect ratio
```{r cap2, echo=FALSE}
cap2 <- "The same data are used for both graphs. The pattern that is not
obvious in Panel A is very obvious in Panel B"
```
```{r Banking, fig.width=8, fig.height=3.5, out.width="100%", fig.show="hold", fig.pos="H", fig.cap=cap2}
suppressPackageStartupMessages(library(latticeExtra))
trellis.par.set(theme)
gph1 <- xyplot(sin((1:30)*0.92) ~ I((1:30)*0.92), xlab="", ylab="",
## main=list("A: Pattern is hidden to view",
## x=0.1, y=-0.5, just='left', font=1),
scales=list(y=list(at=c(-1,0,1))))
gph2 <- xyplot(sin((1:30)*0.92) ~ I((1:30)*0.92), aspect='xy',
xlab="", ylab="", ylim=c(-9,9),
## main=list("B: Pattern is now obvious",
## x=0.1, y=-0.5, just='left', font=1)
scales=list(y=list(at=seq(from=-8,to=8,by=2)))
)
update(c("A: Pattern is hidden to view"=gph1,
"B: Pattern is now obvious"=gph2, y.same=FALSE), between=list(x=0.5,y=0))
```
Patterns of change on the horizontal scale that it is important to identify should bank at an angle of roughly 45^o^ above or below the horizontal. Angles in the approximate range 20^o^ to 70^o^ may be satisfactory, and the aspect ratio should be chosen accordingly.
## Scales that show changes by equal multipliers
Figure \@ref(fig:animals) shows two plots of the same data. Panel A plots brain weight (grams) against body weight (kilograms), for 28 "animals". Panel B plots the same data, but now equal distances on each scale show changes by the same factor (i.e., change in relative weight).
```{r cap3, echo=FALSE}
cap3 <- "Panel A plots brain weight (grams) against body weight
(kilograms), for 28 'animals'. Panel B plots the same data,
with relative weight scales, i.e., equal distances on each scale
show changes by the same multiplier."
```
```{r animals, fig.width=8, fig.height=3.5, out.width="100%", fig.show="hold", fig.pos="H", fig.cap=cap3}
Animals <- MASS::Animals
gph1 <- xyplot(brain ~ body, data=Animals,
xlab="Body weight (k)", ylab="Brain weight (g)",
main=list("A: Scales show weights", x=0.1, y=-0.5, just='left', font=1))
print(gph1, position=c(0,0,0.525,1), more=TRUE)
labs <- c('Asian elephant','African elephant','Human','Dipliodocus',
'Triceratops','Brachiosaurus','Mouse')
atx <- c(.1,1,10,100,1000,10000,100000)
aty <- c(1,10,100,1000,10000)
Animlabs <- subset(Animals, rownames(Animals)%in%labs)
gph2 <- xyplot(brain ~ body, data=Animals,
scales=list(log='e', x=list(at=atx,labels=paste(atx)),
y=list(at=aty,labels=paste(aty))),
xlab="Body weight (kg)", ylab=" ",
main=list("B: Relative weight scales",
x=0.1, y=-0.5, just='left', font=1)) +
latticeExtra::layer(data=Animlabs,
panel.text(log(body),log(brain)+c(-0.1,-0.12,0,0.2,0.15,0,0),
cex=0.75, labels=rownames(Animlabs), pos=c(c(rep(2,5),4,2))))
print(gph2, position=c(0.475,0,1,1))
```
Often, when measurement data span a large range (e.g., a change from smallest to largest by a factor of 100 to 1 or more), it is a relative amount scale that is appropriate.[^02-graphs-1]
[^02-graphs-1]: Technically, such scales are termed logarithmic, as opposed to straight line or linear. A logarithmic transformation is used to obtain such relative distance scales.
## Different graphs serve different purposes {#track}
The line in Figure \@ref(fig:resid)A shows the broad overall pattern, while Figure \@ref(fig:resid)B shows how that pattern needs to be tweaked to more closely reflect the data.
```{r cap4, echo=FALSE}
cap4 <- "Panel A plots world record Time (as of 2006, in minutes)
against Distance (in kilometers), for field races. On both the
`x` and `y` axes, the scale is one on which equal distances on
the scale correspond to equal relative changes. Panel B plots
deviations from the fitted line in the `y` direction, otherwise
known as residuals, against Distance. The deviations are
approximate relative differences from the line. Thus a 0.05
difference is a difference that amounts to 5\\% of the time
predicted by the line."
```
```{r resid, fig.width=8, fig.height=3.25, out.width="100%", fig.show="hold", fig.pos="H", fig.cap=cap4}
atx <- c(.1,.3,1,3,10,30,100)
aty=c(.1,.3,1,3,10,30,100,300)
track <- subset(DAAG::worldRecords, roadORtrack=='track')
gph1 <- xyplot(Time ~ Distance, type=c('p','r'), data=track,
xlab="Distance (km)", ylab="Time (min)",
scales=list(log='e', x=list(at=atx,labels=paste(atx)),
y=list(at=aty,labels=paste(aty))),
main=list("A: Time versus Distance", x=0.1, y=-0.5, just='left', font=1)) +
latticeExtra::layer(panel.axis(side='top', at=log(c(.2,2,20)),
labels=c('200m',"2km",'20km'), text.col="gray50", line.col='gray50',line.lty=2,
half=FALSE, rot=0))
print(gph1, position=c(0,0,0.5,1), more=TRUE)
track.res <-residuals(lm(log(Time)~log(Distance), data=track))
gph2 <- xyplot(track.res ~ Distance, type=c('p','r'), data=track,
scales=list(x=list(log='e', at=atx,labels=paste(atx))),
xlab="Distance (km)", ylab="Deviations from fitted line",
main=list("B: Residuals versus Distance", x=0.1, y=-0.5, just='left', font=1)) +
latticeExtra::layer(panel.axis(side='top', at=log(c(.2,2,20)),
labels=c('200m',"2km",'20km'), text.col="gray50", line.col='gray50',line.lty=2,
half=FALSE, rot=0))
print(gph2, position=c(0.5,0,1,1))
```
Notice, in Panel A, the use of scales for which which equal distances on the scale correspond to equal relative changes. This is achieved by specifying logarithmic scales, on both axes. There is a loglinear, i.e., straight line on logarithmic scales, relationship.
In Figure \@ref(fig:resid), the line looks to be a good fit. The range of times is however large, from just under 10 seconds to close to 11.5 hours. All except the largest difference from the fitted line are a less than 7% change, and are not at all obvious in Panel A. There is a very clear pattern of systematic differences in Panel B that reflects differences in human physiology, very likely between the athletes who excel at the different distances.
The line can be interpreting as implying a 13% increase in the time per unit distance for every unit increase in the distance. The units may for example be units of 100 meters, or kilometers. Panel B indicates that the pattern of increase moves down to a local minimum at around 200 meters, up to a local maximum at around 1 kilometer, down again to a local minimum at around 20 kilometers, and then steadily up again.
### Relative distance scales {.unnumbered}
Figure \@ref(fig:lnscales) shows different "equal physical distance along the scale" labels that might be used for the relative `Distance` ("logarithmic") scale in Figure \@ref(fig:resid) in Subsection \@ref(track).
```{r cap5, echo=FALSE}
cap5 <- "Different labelings, all with tick marks at the same
relative distance apart, are shown for the `Distance` scale.
The multipliers for the `Distance` values that are plotted are,
starting at the bottom, 2, 4, 5, and 10."
```
```{r lnscales, fig.width=5.7, fig.height=3.75, out.width="60%", fig.show="hold", fig.pos="H", fig.cap=cap5, echo=FALSE, fig.align='center'}
atx <- c(.1,.3,1,3,10,30,100)
track <- subset(DAAG::worldRecords, roadORtrack=='track')
rx <- range(track$Distance)
atx10 <- 10^seq(from = floor(log10(rx[1])), to=ceiling(log10(rx[2])))
atx2 <- 2^seq(from = floor(log(rx[1],base=2)), to=ceiling(log(rx[2],base=2)))
lab2 <- gsub('0.','.',paste(atx2), fixed=T)
atx4 <- 4^seq(from = floor(log(rx[1],base=4)), to=ceiling(log(rx[2],base=4)))
atx5 <- 5^seq(from = floor(log(rx[1],base=5)), to=ceiling(log(rx[2],base=5)))
plot(log(rx), c(0,9), type='n', mgp=c(0,0,0), axes=FALSE, xlab='', ylab='')
axis(1, at=log(atx10), labels=paste(atx10), pos=9)
axis(1, at=log(atx5), labels=paste(atx5), pos=6)
axis(1, at=log(atx4), labels=paste(atx4), pos=3)
axis(1, at=log(atx2), labels=lab2, pos=0, gap.axis=0.1)
axis(2, at=c(0,3,6,9), lwd=0, lwd.ticks=0,
labels=expression('' %*% 2,'' %*% 4,'' %*% 5,'' %*% 10), las=1)
```
## Helpful web links are:
- Good & bad graphs (Ihaka, lecture notes)[^02-graphs-2]
- Misleading graphs[^02-graphs-3]
- Color Brewer[^02-graphs-4]
[^02-graphs-2]: https://www.stat.auckland.ac.nz/\~ihaka/120/Lectures/lecture03.pdf
[^02-graphs-3]: https://www.statisticshowto.com/misleading-graphs/
[^02-graphs-4]: https://colorbrewer2.org/