forked from jtr13/EDAV
-
Notifications
You must be signed in to change notification settings - Fork 0
/
cleveland.Rmd
106 lines (76 loc) · 4.6 KB
/
cleveland.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# Chart: Cleveland Dot Plot {#cleveland}
![](images/banners/banner_cleveland.png)
*This page is a work in progress. We appreciate any input you may have. If you would like to help improve this page, consider [contributing to our repo](contribute.html).*
## Overview
This section covers how to make Cleveland dot plots. Cleveland dot plots are a great alternative to a simple bar chart, particularly if you have more than a few items. It doesn’t take much for a bar chart to look cluttered. In the same amount of space, many more values can be included in a dot plot, and it’s easier to read as well. R has a built-in base function, `dotchart()`, but since it’s such an easy graph to draw, doing it “from scratch” in *ggplot2* or base allows for more customization.
```{r ggdot, fig.height = 6, fig.width = 5, echo = FALSE, warning = FALSE, message = FALSE}
library(tidyverse)
# create a theme for dot plots, which can be reused
theme_dotplot <- theme_bw(14) +
theme(axis.text.y = element_text(size = rel(.75)),
axis.ticks.y = element_blank(),
axis.title.x = element_text(size = rel(.75)),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(size = 0.5),
panel.grid.minor.x = element_blank())
# move row names to a dataframe column
df <- swiss %>% tibble::rownames_to_column("Province")
# create the plot
ggplot(df, aes(x = Fertility, y = reorder(Province, Fertility))) +
geom_point(color = "blue") +
scale_x_continuous(limits = c(35, 95),
breaks = seq(40, 90, 10)) +
theme_dotplot +
xlab("\nannual live births per 1,000 women aged 15-44") +
ylab("French-speaking provinces\n") +
ggtitle("Standardized Fertility Measure\nSwitzerland, 1888")
```
The code:
```{r ref.label='ggdot', eval=FALSE}
```
## Multiple dots
For this example we'll use 2010 data on SAT mean scores for a sample of New York City public schools:
```{r}
df <- read_csv("data/SAT2010.csv", na = "s")
set.seed(5293)
tidydf <- df %>%
filter(!is.na(`Critical Reading Mean`)) %>%
sample_n(20) %>%
rename(Reading = "Critical Reading Mean", Math = "Mathematics Mean",
Writing = "Writing Mean") %>%
gather(key = "Test", value = "Mean", "Reading", "Math", "Writing")
ggplot(tidydf, aes(Mean, `School Name`, color = Test)) +
geom_point() +
ggtitle("Schools are sorted alphabetically", sub = "not the best option") + ylab("") +
theme_dotplot
```
Note that `School Name` is sorted by factor level, which by default is alphabetical. A better choice is to sort by one of the levels of `Test`. It's usually best to try sorting on different factor levels and observe the patterns that appear.
To perform the double sort, that is, sorting `School Name` by `Test` *and then* `Mean`, we use `forcats::fct_reorder2()`. This function sorts `.f` (a factor or character vector) by two sorting vectors, `.x` and `.y`. For this type of plot, `.x` is the variable represented by the colored dots and `.y` is the continuous variable mapped to the y-axis.
Suppose we wish to sort the schools by mean reading score. We can do this by limiting the `Test` variable to "Reading" when sorting on `Mean`:
```{r}
ggplot(tidydf,
aes(Mean, fct_reorder2(`School Name`, Test=="Reading", Mean, .desc = FALSE),
color = Test)) +
geom_point() + ggtitle("Schools sorted by Reading mean") + ylab("") +
theme_dotplot
```
(Many thanks to Zeyu Qiu for the tip on setting `.x` directly to the factor level, a much better approach than reordering factor levels to conform with `fct_reorder2()` defaults, as discussed below.)
While this is the go-to method, there may be cases in which it's easier to specify that you wish to sort by the first or the last factor level of the first sorting variable (`Test`), without spelling it out.
If a factor level is not specified, `fct_reorder2()` by default will sort on the *last* factor level of `.x`. In this case, "Writing" is the last factor level of `Test`:
```{r}
ggplot(tidydf,
aes(Mean, fct_reorder2(`School Name`, Test, Mean, .desc = FALSE),
color = Test)) +
geom_point() + ggtitle("Schools sorted by Writing mean") + ylab("") +
theme_dotplot
```
If you desire to sort by the *first* factor level of `.x`, "Math" in this case, you'll need the development version of **forcats**, which you can install with:
`devtools::install_github("tidyverse/forcats")`
Change the default sorting function, `last2()`, to `first2()`:
```{r}
ggplot(tidydf,
aes(Mean, fct_reorder2(`School Name`, Test, Mean, .fun = first2, .desc = FALSE),
color = Test)) +
geom_point() + ggtitle("Schools sorted by Math mean") + ylab("") +
theme_dotplot
```