-
Notifications
You must be signed in to change notification settings - Fork 9
/
graphics.Rmd
657 lines (489 loc) · 22.3 KB
/
graphics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
---
title: 'Graphics'
---
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, cache=TRUE, message=F)
library(tidyverse)
library(pander)
library(knitr)
```
# Graphics {#graphics}
```{r, echo=F}
set.seed(1234)
data_frame(pictures = abs(rnorm(100, 30, 10)), words = pictures*1000+rnorm(100, 0, 2000)+runif(100, -2000, 2000)) %>%
ggplot(aes(pictures, words)) + geom_point() + geom_smooth(se=F)
```
<!-- TODO ADD @matejka2017sameand dino plots -->
Plotting and data visualisation is probably the best thing about R.
The base system alone provides lots of useful plotting functions, but the
`ggplot2` package is exceptional in the consistent and powerful approach it
takes to visualising data. This chapter focusses mostly on `ggplot`, but does
include some pointers to other useful plotting functions.
It's also worth pointing out here that the O'Reilly
[R Graphics cookbook is available as a pdf download](https://ase.tufts.edu/bugs/guide/assets/R%20Graphics%20Cookbook.pdf)
and is a much more comprehensive source than this book.
The examples below are more selective and show plots likely to be of particular
use in reporting your studies.
The emphaisis on showing you how to make _good_ plots that help you explore data
and communicate your findings, rather than simply reproduce the output of SPSS
or Excel.
## Benefits of visualising data {- #graphics-benefits}
Scientists attempt to understand natural processes, predict events in the
observable world, and communicate this understanding to others. In each case,
graphics are a powerful tool in their armoury.
<!-- The doctrine of null hypothesis testing which has infected psychology and many other fields has made some researchers nervous about the power of graphics to explore and find patterns in data: perhaps concerned with the accusation of undertaking a 'fishing trip' or capitalising on chance in ther analyses, too many scientists have eschewed graphical statistics and focussed instead on point estimates and *p* values from complex statistical models.
But as the complexity of theories group, the role of graphics becomes ever more important because the _right_ graphical presentations of our data are in many cases the only reliable way of checking the appropriateness of our models, and of communicating our findings efficiently. -->
David McCandless recently spoke at a TedX event about his work on data
journalism, and makes a persuasive case for paying more attention to
visualisations:
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/5Zg-C8AAIGg?rel=0" frameborder="0" allowfullscreen></iframe>
<!--
For a fascinating history and exploration of the good, the bad and the ugly in data visualisations Edward Tufte's book [@edward2001visual]. -->
## Which tool to use? {- #graphics-approaches}
When plotting data it pays to decide if you need:
1. A quick way to visualise something specific -- for example to check some
feature of your data before you continue your analysis.
2. A plot that is specifically designed to communicate your data effectively,
and where you care about the details of the final output.
For the first case --- for example to visualise a distribution of a single
variable, or to check diagnostics from a linear model --- there are lots of
useful built-in functions in base-R and other packages.
For the second case --- for example where you want to visualise the main
outcomes of your study, or draw attention to specific aspects of your data ---
there is `ggplot2`.
We case 2 first, because using ggplot highlights many important aspects of
plotting in general.
## Layered graphics with `ggplot` {- #layered-graphics}
If you've never given much thought to data visualisation before, you might be
surprised at the sheer variety of graphs types available.
One way to cut through the multitude of options is to determine what the purpose
of your plot is. Although not a complete list, it's likely your plot will show
at least one of:
- Relationships
- Distributions
- Comparison
- Composition
The `ggplot` library makes it easy to produce high quality graphics which serve
these ends, and to layer them to produce information-dense plots which are
really effective forms of communication.
```{r, fig.width=8, echo=F, fig.cap="Examples of charts showing comaprisons, relationships, distribution and composition. The comparison, distribution and composition plots show 2 variables, but the relationship plot includes 3, increasing the density of the information displayed. "}
comparison <- ggplot(mtcars, aes(factor(cyl), mpg)) + geom_boxplot() + ggtitle("Comparison")
relationships <- ggplot(mtcars, aes(wt, mpg, color=factor(gear))) + geom_point() + ggtitle("Relationship")
distributions <- ggplot(mtcars, aes(mpg, color=factor(gear))) + geom_density() + ggtitle("Distribution")
composition <- ggplot(mtcars, aes(factor(cyl), fill = factor(gear))) + geom_bar() + ggtitle("Composition")
mm <- theme(plot.margin=unit(rep(1.5,4), "line"))
gridExtra::grid.arrange(relationships+mm, distributions+mm, comparison+mm, composition+mm, ncol=2)
```
### A thought on 'chart chooser' guides {- .explainer}
There are various simple chart selection guides available online, of which these
are quite nice examples:
- [Chart selection guide (pdf)](http://extremepresentation.typepad.com/blog/2006/09/choosing_a_good.html)
- ['Show me the numbers' chart guide (pdf)](https://www.perceptualedge.com/articles/misc/Graph_Selection_Matrix.pdf)
![](media/choosing_a_good_chart.jpg)
_However_, guides which attempt to be comprehensive and show you a full range of
plot types are perhaps not as useful as those which reflect our knowledge of
which plots are the most effective forms of communication.
For example, almost all guides to plotting, and especially R textbooks, will
show you how to plot a simple bar graph. But bar graphs have numerous
disadvantages over other plots which can show the same information.
Specifically, they:
- are low in information density (and so inefficient in use of space)
- make comparisons between multiple data series very difficult (for example in
[interaction plots](#understanding-interactions)), and
- perhaps most importantly, even when they include error bars, readers
consistently misinterpret the quantitative information in bar graphs
(specifically, when bar graphs are used to display estimates which contain
error, readers assume points above the bar are less likely than points
within the bar, even though this is typically not the case).
You should be guided in choosing plots not by mechanical rules based on the
number or type of variables you want to display. Instead, you should be guided
by the evidence from basic studies of human perception, and applied data on how
different types of infromation displays are really used by readers.
**This guide is restricted to examples likely to be useful and effective for
experimental and applied psychologists.**
### Thinking like `ggplot` {-}
When using `ggplot` it helps to think of five separate steps to making a plot (2
are optional, but commonly used):
1. Choose the data you want to plot.
2. Map variables to axes or other features of the plot (e.g. sizes or colours).
3. (Optionally) use `ggplot` functions to summarise your data before the plot is
drawn (e.g. to calulate means and standard errors for point-range plots).
4. Add visual display layers.
5. (Optionally) Split the plot up across multiple panels using groupings in the
data.
You can then customise the plot labels and title, and tweak other presentation
parameters, although this often isn't necessary unless sending a graphic for
publication. You can also export graphics in multiple high quality formats.
The simplest way to demonstrate these steps is with an example, and we begin
with a plot showing the relationship betwen variables:
### 'Relationships' {-}
Problem to be solved: _You want to check/show whether variables are related in a
linear fashion, e.g. before running linear regression_
##### Step 1: Select data to plot {-}
Step 1 is to select our data. As is typical in R, `ggplot` works best with
long-format data. In the examples below we will use the `mtcars` dataset for
convenience, so our first line of code is to use the `dplyr` pipe symbol
(operator) to send the `mtcars` dataset to the next line of code:
```{r, eval=F}
mtcars %>%
...
```
##### Step 2: Map variables to axes, colours, and other features {-}
Step 2 is to map the variables we want to axes or other features of the plot
(e.g. the colours of points, or the linetypes used in line plots).
The specify these mappings we use the `aes()` function, which is slightly
cryptic, but short for 'aesthetics mapping'. Depending on the plot type you will
specify different aesthetics, and they can also have different effects depending
on the plot type, but you will commonly specify:
- `x` the variable to use as the x axis
- `y` the variable to use as the y axis
- `colour`: the variable to use to colour points or lines
Here we tell `ggplot` to use `disp` (engine size) on the x axis, and `mpg` on
the y axis. We also tell it to colour the points differently depending on the
value of `hp` (engine horsepower).
At this point `ggplot` will create and label the axes and plot area, but doesn't
yet display any of our data. For this we need to add visual display layers (in
the next step).
```{r}
mtcars %>%
ggplot(aes(x = disp, y = mpg, colour=hp))
```
###### Other aesthetics {- #aesthetics .explainer}
There are many other aesthetics which can be specified. Some of the most useful
are:
- `ymin` and `ymax`: for upper and lower bounds, e.g. on error bars
- `group`:which tells ggplot to group observations by some variable and, for
example, plot a different line per-group)
- `fill`: like `colour` but for areas/shapes
- `alpha` and `size`: control the size and opacity of visual features (useful
for de-emphasising some features of a plot to make others stand out)
See the ggplot documentation for more details:
<http://ggplot2.tidyverse.org/reference/#section-aesthetics>
##### Step 3 {-}
We skip step 3 for this example (asking ggplot to automatically make summaries
of our data before plotting), but will cover it below - see the `stat_summary()`
function.
##### Step 4: Display data {-}
To display data, we have to add a visual layer to the plot. For example, let's
say we want to make a scatter plot, and so draw points for each row of data:
```{r}
mtcars %>%
ggplot(aes(x = disp, y = mpg, colour=hp)) +
geom_point()
```
And we have a pretty slick graph: `ggplot` has now added points for each pair of
`disp` and `mpg` values, and coloured them according to the value of `hp` (see
[choosing colours below](#picking-colours)).
[Use the `airquality` dataset and create your own scatterplot and try to colour
the points using the `Month` variable. Should `Month` be used as a factor or a
numeric variable when colouring the points?]{.exercise}
```{r, include=F, echo=F}
airquality %>%
ggplot(aes(Ozone, Temp, color=factor(Month))) + geom_point()
```
What's even neater about `ggplot` though is how easy it is to _layer_ different
visualisations of the same data. These visual layers are called `geom`'s and the
functions which add them are all prefixed with `geom_`, so `geom_point()` for
scatter plots, or `geom_line()` for line plots, or `geom_smooth()` for a
smoothed line plot. We can add this to the scatter plot like so:
```{r}
mtcars %>%
ggplot(aes(x = disp, y = mpg, colour=hp)) +
geom_point(size=2) +
geom_smooth(se=F, colour="grey")
```
In the example above, I have also customised the smoothed line, making it grey
to avoid over-intrusion into our perception of the points. Often less is more
when plotting graphs: not everything can be emphasised at once, and _it's
important to make decisions about what should be given visual priority_.
##### Step 5: 'Splitting up' or repeating the plot. {-}
Very often, you will have drawn plot and think things like _I wonder what that
would look like if I drew it for men and women separately?_. In `ggplot` this is
called facetting, and is easy to achieve, provided your data are in a long
format.
Using the same `mtcars` example, let's say we wanted separate panels for
American vs. Foreign cars (information held in the `am` variable). We simply add
the `facet_wrap()`, and specify the `"am"` variable:
```{r}
mtcars %>%
ggplot(aes(x = disp, y = mpg, colour=hp)) +
geom_point(size=2) +
geom_smooth(se=F, colour="grey") +
facet_wrap("am")
```
One trick is to make sure factors are labelled nicely, because these labels
appear on the final plot. Here the `mutate()` call
[relabels the factor](real-data.html#factors-and-numerics) which makes the plot
easier to read:
```{r}
mtcars %>%
mutate(american = factor(am, labels=c("American", "Foreign"))) %>%
ggplot(aes(x = disp, y = mpg, colour=hp)) +
geom_point(size=2) +
geom_smooth(se=F, colour="grey") +
facet_wrap("american")
```
[See the ggplot documentation on facetting for more details](http://ggplot2.tidyverse.org/reference/#section-facetting).
### 'Distributions' {-}
```{r}
lme4::sleepstudy %>%
ggplot(aes(Reaction)) + geom_density()
```
Imagine we wanted to compare distributions for individuals. Simply overlaying
the lines is confusing:
```{r}
lme4::sleepstudy %>%
ggplot(aes(Reaction, group=Subject)) + geom_density()
```
Facetting produces a nicer result:
```{r}
lme4::sleepstudy %>%
ggplot(aes(Reaction)) + geom_density() + facet_wrap("Subject")
```
But we could present the same information more compactly, and with better
facility to compare between subjects, if we use a bottleplot:
```{r}
lme4::sleepstudy %>%
ggplot(aes(Subject, Reaction)) +
geom_violin()
```
We might want to plot our Subjects in order of their mean RT:
```{r}
mean.ranked.sleep <- lme4::sleepstudy %>%
group_by(Subject) %>%
# calculate mean RT
mutate(RTm = mean(Reaction)) %>%
# sort by mean RT
arrange(RTm, Days) %>%
ungroup() %>%
# create a rank score but conert to factor right away
mutate(SubjectRank = factor(dense_rank(RTm)))
mean.ranked.sleep %>%
ggplot(aes(SubjectRank, Reaction)) +
geom_violin() +
theme(aspect.ratio = .33) # change the aspect ratio to make long and wide
```
Or we might want to compare individuals against the combined distribution:
```{r}
# duplicate all the data, assigning one-replication to a single subject, "All"
sleep.repeat <- bind_rows(lme4::sleepstudy,
lme4::sleepstudy %>% mutate(Subject="All"))
sleep.repeat %>%
mutate(all = Subject=="All") %>%
ggplot(aes(Subject, Reaction, color=all)) +
geom_violin() +
guides(colour=FALSE) + # turn off he legend because we don't really need it
theme(aspect.ratio = .25) # change the aspect ratio to make long and wide
```
Boxplots can also work well to show distributions, and have the advantage of
showing the median explicitly:
```{r}
mean.ranked.sleep %>%
ggplot(aes(SubjectRank, Reaction)) +
geom_boxplot()
```
If we plot the same data by-day, we can clearly see the effect of sleep
deprivation, and the increase in variability between subjects as sime goes on:
the lack of sleeps seems to be affecting some subjects more than others
```{r}
lme4::sleepstudy %>%
ggplot(aes(factor(Days), Reaction)) +
geom_boxplot()
```
### 'Comparisons' {-}
Imagine we have rainfall and temperature data for various regions, across
months:
```{r}
DAAG::bomregions %>%
psych::describe(fast=T) %>%
pander
```
##### {- #extract-to-split-column-names}
Because these data are in wide format (multiple columns have contain the same
type of data) we need to first convert to long format. This process has become
known as [tidying](http://tidyr.tidyverse.org).
The code below selects the columns related to rainfall and temperature and then
'melts' the data to long format: one row per observation. We then `extract` the
region and the type of measurement from the `variable` column which is created
by using a regular expression:
```{r}
weather.data.long <- DAAG::bomregions %>%
select(Year, ends_with('Rain'), ends_with('AVt')) %>%
reshape2::melt(id.var="Year") %>%
extract(variable,
into=c("Region", "variable"),
regex="(east|north|se|south|west|sw|au)(\\w+)") %>%
filter(!is.na(variable))
weather.data.long %>% head
```
##### {.exercise}
Try stepping through the code above line by line and see what is produced at
each step.
With our data in long form we can use `ggplot` to plot our data over time. In
the plot below, we use `filter()` to select only the rainfall measurements:
```{r}
rain <-
weather.data.long %>%
filter(variable=='Rain')
rain %>%
ggplot(aes(Year, value)) +
geom_point() +
ylab('Rainfall (mm)')
```
There are many ways to extract structure from these data, and make comparisons
over time. One is to use colour:
```{r}
rain %>%
ggplot(aes(Year, value, color=Region)) +
geom_point() +
ylab('Rainfall (mm)')
```
It's now easy to see the crude differences between the regions (west appears the
driest, sw the wettest), but it's still hard to compare between regions, or over
time.
For this we can use various forms of summaries, for example a smoothed line plot
(the shaded area is the standard error), which clearly shows the relative
changes in rainfall in each region over time:
```{r}
rain %>%
ggplot(aes(Year, value, color=Region)) +
geom_smooth() +
ylab('Rainfall (mm)')
```
It can sometimes be desireable to include points to preserve the relationship
between the summary and the raw data:
```{r}
rain %>%
filter(Region %in% c('west', 'se')) %>%
ggplot(aes(Year, value, color=Region)) +
geom_point(alpha=.5, size=.5) +
geom_smooth(se=F) +
ylab('Rainfall (mm)')
```
If we weren't interested in the time series and just wanted to focus on the most
recent year, we might take a different approach. Here a bar plot is used to
compare between regions, with the national average (au) highlighted in blue:
```{r}
rain %>%
filter(Year == max(Year)) %>%
ggplot(aes(Region, value, fill=Region=="au")) +
stat_summary(geom="bar") +
ylab('Rainfall (mm)') +
guides(fill=F)
```
Finally, we shouldn't forget that this dataset included both rainfall and
temperature data, and we can display these in different ways, depending on what
our research question was.
In this case we can use facetting to display temperature and rainfall data side
by side. Because AVt and Rain are on such different scales we need to allow the
y axis to vary between variables:
```{r}
weather.data.long %>%
ggplot(aes(Year, value, color=Region)) +
geom_smooth() +
facet_wrap(~variable, scales="free")
```
<!-- ```{r} -->
<!-- dataurl <- function(year, country){ -->
<!-- sprintf("https://www.census.gov/population/international/data/idb/region.php?N=Results&T=10&A=separate&RT=0&Y=%.0f&R=-1&C=%s", 2017, "UK") -->
<!-- } -->
<!-- df <- read.csv(dataurl(2017, 'UK')) -->
<!-- ``` -->
### 'Composition' {-}
#### Waffle plots or 'pictograms' {-}
Waffle plots are neat way of showing the relative frequency of different
categories.
In applied settings it's well known that when considering the risks or benefits
of interventions clinicans, patients and researchers benefit from statements
made using 'natural frequencies' [@gigerenzer2003simple], and pictographs or
'waffle plots' have been shown to provide patients and their families with a
better understanding of the risks of treatments [@tait2010presenting].
Waffle plots can be implemented in R via the `waffle::` package:
```{r}
outcomes <- c("gained weight"=13,
"no change"=27,
"lost 2kg or more" = 44,
"lost 5kg or more" = 16)
waffle::waffle(outcomes)
```
[Calculating these summary figures is left as an exercise to for the reader, but
see the section on [summarising data with dplyr](#split-apply-combine).]{.tip}
This is one of those occasions where the default ggplot colours could probably
be improved, and we can do this by passing the
[names of the colours](#named-colours) we would like to use:
```{r}
weight.loss.colours <- c('firebrick2', 'darkgoldenrod1', 'palegreen3', 'palegreen4')
waffle::waffle(outcomes, colors = weight.loss.colours)
```
Selecting colours by hand isn't always the best way though: the
[`colourbrewer` library provides some nice shortcuts](#color-brewer) for using
palletes from the excellent [ColourBrewer website](http://colorbrewer2.org).
#### Stacked bars {- #stacked-bars}
The `reshape2` package includes data on tipping habits in restaurants for male
and female bill-payers, and where the party was fo various `size`s:
```{r}
reshape2::tips %>%
head %>%
pander
```
To begin, we might like to plot tips by time of day:
```{r}
reshape2::tips %>%
ggplot(aes(time, tip)) +
stat_summary(geom="bar")
```
And to facilitate the comparison between men and women we could colour portions
of the bars using `position_stack()`:
```{r}
reshape2::tips %>%
ggplot(aes(time, tip, fill=sex)) +
stat_summary(geom="bar", position=position_stack()) +
xlab("") + ylab("Tip ($)")
```
Or to reverse the comparisons:
```{r}
reshape2::tips %>%
ggplot(aes(sex, tip, fill=time)) +
stat_summary(geom="bar", position=position_stack()) +
xlab("") + ylab("Tip ($)")
```
## 'Quick and dirty' (utility) plots {- #utility-plotting-functions}
When exploring a dataset, often useful to use built in functions or helpers from
other libraries. These help you quickly visualise relationships, but aren't
always _exactly_ what you need and can be hard to customise.
### Distributions
```{r}
hist(mtcars$mpg)
plot(density(mtcars$mpg))
boxplot(mpg~cyl, data=mtcars)
Hmisc::hist.data.frame(mtcars)
```
Even for simple plots, ggplot has some useful helper functions though:
```{r}
qplot(mpg, data=mtcars, geom="density") + xlab("Miles per gallon")
qplot(x=factor(cyl), y=mpg, data=mtcars, geom="boxplot")
```
### Relationships
```{r}
with(mtcars, plot(mpg, wt))
pairs(select(mtcars, wt, disp, mpg))
```
Again, for quick plots ggplot also has useful shortcut functions:
```{r}
qplot(mpg, wt, color=factor(cyl), data = mtcars)
```
### Quantities
I don't think the base R plots are that convenient here. `ggplot2::` and the
`stat_summary()` function makes life much simpler:
```{r}
ggplot(mtcars, aes(factor(cyl), mpg)) +
stat_summary(geom="bar")
```
And if you are plotting quantities, as disussed above, showing a range is
sensible (a boxplot would also fill both definitions):
```{r}
ggplot(mtcars, aes(factor(cyl), mpg)) +
stat_summary(geom="pointrange")
```
<!-- ## Tufte in R -->
<!-- tufte graphics in r http://motioninsocial.com/tufte/ -->