forked from jtr13/EDAV
-
Notifications
You must be signed in to change notification settings - Fork 0
/
scatterplot.Rmd
executable file
·243 lines (189 loc) · 11 KB
/
scatterplot.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
# (PART) Multivariate Continuous {-}
# Chart: Scatterplot {#scatter}
![](images/banners/banner_scatterplot.png)
## Overview
This section covers how to make scatterplots
## tl;dr
Fancy Example NOW! Gimme Gimme GIMME!
<!-- Explanation: -->
Here's a look at the relationship between brain weight vs. body weight for 62 species of land mammals:
```{r tldr-show-plot, echo=FALSE, warning=FALSE, fig.height=6, fig.width=9}
library(MASS) # data
library(ggplot2) # plotting
# ratio for color choices
ratio <- mammals$brain / (mammals$body*1000)
ggplot(mammals, aes(x = body, y = brain)) +
# plot points, group by color
geom_point(aes(fill = ifelse(ratio >= 0.02, "#0000ff",
ifelse(ratio >= 0.01 & ratio < 0.02, "#00ff00",
ifelse(ratio >= 0.005 & ratio < 0.01, "#00ffff",
ifelse(ratio >= 0.001 & ratio < 0.005, "#ffff00", "#ffffff"))))),
col = "#656565", alpha = 0.5, size = 4, shape = 21) +
# add chosen text annotations
geom_text(aes(label = ifelse(row.names(mammals) %in% c("Mouse", "Human", "Asian elephant", "Chimpanzee", "Owl monkey", "Ground squirrel"),
paste(as.character(row.names(mammals)), "→", sep = " "),'')),
hjust = 1.12, vjust = 0.3, col = "grey35") +
geom_text(aes(label = ifelse(row.names(mammals) %in% c("Golden hamster", "Kangaroo", "Water opossum", "Cow"),
paste("←", as.character(row.names(mammals)), sep = " "),'')),
hjust = -0.12, vjust = 0.35, col = "grey35") +
# customize legend/color palette
scale_fill_manual(name = "Brain Weight, as the\n% of Body Weight",
# values = c('#e66101','#fdb863','#b2abd2','#5e3c99'),
values = c('#d7191c','#fdae61','#ffffbf','#abd9e9','#2c7bb6'),
breaks = c("#0000ff", "#00ff00", "#00ffff", "#ffff00", "#ffffff"),
labels = c("Greater than 2%", "Between 1%-2%", "Between 0.5%-1%", "Between 0.1%-0.5%", "Less than 0.1%")) +
# formatting
scale_x_log10(name = "Body Weight", breaks = c(0.01, 1, 100, 10000),
labels = c("10 g", "1 kg", "100 kg", "10K kg")) +
scale_y_log10(name = "Brain Weight", breaks = c(1, 10, 100, 1000),
labels = c("1 g", "10 g", "100 g", "1 kg")) +
ggtitle("An Elephant Never Forgets...How Big A Brain It Has",
subtitle = "Brain and Body Weights of Sixty-Two Species of Land Mammals") +
labs(caption = "Source: MASS::mammals") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68")) +
theme(legend.position = c(0.832, 0.21))
```
And here's the code:
```{r tldr-code, eval=FALSE}
library(MASS) # data
library(ggplot2) # plotting
# ratio for color choices
ratio <- mammals$brain / (mammals$body*1000)
ggplot(mammals, aes(x = body, y = brain)) +
# plot points, group by color
geom_point(aes(fill = ifelse(ratio >= 0.02, "#0000ff",
ifelse(ratio >= 0.01 & ratio < 0.02, "#00ff00",
ifelse(ratio >= 0.005 & ratio < 0.01, "#00ffff",
ifelse(ratio >= 0.001 & ratio < 0.005, "#ffff00", "#ffffff"))))),
col = "#656565", alpha = 0.5, size = 4, shape = 21) +
# add chosen text annotations
geom_text(aes(label = ifelse(row.names(mammals) %in% c("Mouse", "Human", "Asian elephant", "Chimpanzee", "Owl monkey", "Ground squirrel"),
paste(as.character(row.names(mammals)), "→", sep = " "),'')),
hjust = 1.12, vjust = 0.3, col = "grey35") +
geom_text(aes(label = ifelse(row.names(mammals) %in% c("Golden hamster", "Kangaroo", "Water opossum", "Cow"),
paste("←", as.character(row.names(mammals)), sep = " "),'')),
hjust = -0.12, vjust = 0.35, col = "grey35") +
# customize legend/color palette
scale_fill_manual(name = "Brain Weight, as the\n% of Body Weight",
values = c('#d7191c','#fdae61','#ffffbf','#abd9e9','#2c7bb6'),
breaks = c("#0000ff", "#00ff00", "#00ffff", "#ffff00", "#ffffff"),
labels = c("Greater than 2%", "Between 1%-2%", "Between 0.5%-1%", "Between 0.1%-0.5%", "Less than 0.1%")) +
# formatting
scale_x_log10(name = "Body Weight", breaks = c(0.01, 1, 100, 10000),
labels = c("10 g", "1 kg", "100 kg", "10K kg")) +
scale_y_log10(name = "Brain Weight", breaks = c(1, 10, 100, 1000),
labels = c("1 g", "10 g", "100 g", "1 kg")) +
ggtitle("An Elephant Never Forgets...How Big A Brain It Has",
subtitle = "Brain and Body Weights of Sixty-Two Species of Land Mammals") +
labs(caption = "Source: MASS::mammals") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68")) +
theme(legend.position = c(0.832, 0.21))
```
For more info on this dataset, type `?MASS::mammals` into the console.
And if you are going crazy not knowing what species is in the top right corner, it's another elephant. Specifically, it's the African elephant. It also never forgets how big a brain it has. <i class="far fa-smile-beam"></i>
## Simple examples
<!-- Simplify Note -->
That was *too* fancy! Much simpler please!
<!-- Simple Explanation of Data: -->
Let's use the `SpeedSki` dataset from `GDAdata` to look at how the speed achieved by the participants related to their birth year:
```{r simple-example-data}
library(GDAdata)
head(SpeedSki, n = 7)
```
### Scatterplot using base R
```{r base-r}
x <- SpeedSki$Year
y <- SpeedSki$Speed
# plot data
plot(x, y, main = "Scatterplot of Speed vs. Birth Year")
```
<!-- Base R Plot Explanation -->
Base R scatterplots are easy to make. All you need are the two variables you want to plot. Although scatterplots can be made with categorical data, the variables you are plotting will usually be continuous.
### Scatterplot using ggplot2
```{r ggplot}
library(GDAdata) # data
library(ggplot2) # plotting
# main plot
scatter <- ggplot(SpeedSki, aes(Year, Speed)) + geom_point()
# show with trimmings
scatter +
labs(x = "Birth Year", y = "Speed Achieved (km/hr)") +
ggtitle("Ninety-One Skiers by Birth Year and Speed Achieved")
```
<!-- ggplot2 explanation -->
`ggplot2` makes it very easy to create scatterplots. Using `geom_point()`, you can easily plot two different aesthetics in one graph. It also is simple to add on extra formatting to make your plots look nice (All that is really necessary is the data, the aesthetics, and the geom).
## Theory
Scatterplots are very useful in understanding the correlation (or lack thereof) between variables. For example, in [section 13.2](scatter.html#tldr-7) notice the positive relationship between brain and body weight in species of land mammals. The scatterplot gives a good idea of whether that relationship is positive or negative and if there's a correlation. However, don't mistake correlation in a scatterplot for causation!
Below we show variations on the scatterplot which can be used to enhance interpretability.
<!-- *Link to textbook -->
* For more info about adding lines/contours, comparing groups, and plotting continuous variables check out [Chapter 5](http://www.gradaanwr.net/content/ch05/){target="_blank"} of the textbook.
## When to use
<!-- Quick Note on When to use this plot -->
Scatterplots are great for exploring relationships between variables. Basically, if you are interested in how variables relate to each other, the scatterplot is a great place to start.
## Considerations
<!-- * List of things to pay attention to with examples -->
### Overlapping data
Data with similar values will overlap in a scatterplot and may lead to problems. Consider exploring [alpha blending](iris.html#aside-example-where-alpha-blending-works) or [jittering](iris.html#second-jittering) as remedies (links from [Overlapping Data](iris.html#overlapping-data) section of [Iris Walkthrough](iris.html)).
### Scaling
Consider how scaling can modify how your data will be perceived:
```{r scaling-fix}
library(ggplot2)
num_points <- 100
wide_x <- c(rnorm(n = 50, mean = 100, sd = 2),
rnorm(n = 50, mean = 10, sd = 2))
wide_y <- rnorm(n = num_points, mean = 5, sd = 2)
df <- data.frame(wide_x, wide_y)
ggplot(df, aes(wide_x, wide_y)) +
geom_point() +
ggtitle("Linear X-Axis")
ggplot(df, aes(wide_x, wide_y)) +
geom_point() +
ggtitle("Log-10 X-Axis") +
scale_x_log10()
```
## Modifications
### Contour lines
<!-- blurb -->
Contour lines give a sense of the density of the data at a glance.
For these contour maps, we will use the `SpeedSki` dataset.
Contour lines can be added to the plot call using `geom_density_2d()`:
```{r}
ggplot(SpeedSki, aes(Year, Speed)) +
geom_density_2d()
```
Contour lines work best when combined with other layers:
```{r}
ggplot(SpeedSki, aes(Year, Speed)) +
geom_point() +
geom_density_2d(bins = 5)
```
### Scatterplot matrices
If you want to compare multiple parameters to each other, consider using a scatterplot matrix. This will allow you to show many comparisons in a compact and efficient manner.
For these scatterplot matrices, we will use the `movies` dataset from the `ggplot2movies` package.
As a default, the base R `plot()` function will create a scatterplot matrix when given multiple variables:
```{r message=FALSE, fig.width=7, fig.height=7}
library(ggplot2movies) # data
library(dplyr) # manipulation
index <- sample(nrow(movies), 500) #sample data
moviedf <- movies[index,] # data frame
splomvar <- moviedf %>%
dplyr::select(length, budget, votes, rating, year)
plot(splomvar)
```
While this is quite useful for personal exploration of a datset, it is **not** recommended for presentation purposes. Something called the [Hermann grid illusion](https://en.wikipedia.org/wiki/Grid_illusion){target="_blank"} makes this plot very difficult to examine.
To remove this problem, consider using the `splom()` function from the `lattice` package:
```{r, fig.width=7, fig.height=7}
library(lattice) #sploms
splom(splomvar)
```
## External resources
<!-- - []](){target="_blank"}: Links to resources with quick blurb -->
- [Quick-R article](https://www.statmethods.net/graphs/scatterplot.html){target="_blank"} about scatterplots using Base R. Goes from the simple into the very fancy, with Matrices, High Density, and 3D versions.
- [STHDA Base R](http://www.sthda.com/english/wiki/scatter-plots-r-base-graphs){target="_blank"}: article on scatterplots in Base R. More examples of how to enhance the humble graph.
- [STHDA ggplot2](http://www.sthda.com/english/wiki/ggplot2-scatterplot-easy-scatter-plot-using-ggplot2-and-r-statistical-software){target="_blank"}: article on scatterplots in `ggplot2`. Heavy on the formatting options available and facet warps.
- [Stack Overflow](https://stackoverflow.com/questions/15624656/label-points-in-geom-point){target="_blank"} on adding labels to points from `geom_point()`
- [ggplot2 cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf){target="_blank"}: Always good to have close by.