forked from biom262/cmm262-2022
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Day1_1_Statistics_and_Data_Visualization.Rmd
185 lines (119 loc) · 5.58 KB
/
Day1_1_Statistics_and_Data_Visualization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
title: "Statistics Module - Day 1"
output:
html_document:
df_print: paged
---
# CMM262: Statistics, Day 1 - Summary Statistics and Visualizing Data
**Credits**: Written by Graham McVicker. Some parts based on a prior notebook developed by TAs Michelle Franc Ragsac ([email protected]) and Clarence Mah ([email protected]) .
```{r}
# install required packages
install.packages("palmerpenguins", repos="https://cloud.r-project.org")
install.packages("ggridges", repos="https://cloud.r-project.org")
```
### Exercise 1 - Summary Statistics
> The Palmer Penguins dataset is based on data points collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/), a member of the [Long Term Ecological Research Network](https://lternet.edu/).
>
> Source: https://allisonhorst.github.io/palmerpenguins/
Let's import the Palmer penguins library and load the penguins dataset
```{r}
library(palmerpenguins)
data(penguins, package='palmerpenguins')
```
The penguins dataset is in a [tibble](https://www.rstudio.com/blog/tibble-1-0-0/) which is a more user-friendly version of a data frame. We can view the data in the table:
```{r}
head(penguins, 20)
```
We can get a quick summary of the whole dataset using 'summary':
```{r}
summary(penguins)
```
```{r}
# first create vectors that can be used to pull out male or female chinstrap penguins
male.chinstraps = (penguins$species == "Chinstrap") & (penguins$sex == "male")
female.chinstraps = (penguins$species == "Chinstrap") & (penguins$sex == "female")
```
```{r}
n.male.chinstraps = sum(male.chinstraps)
cat("There are", n.male.chinstraps, "male chinstrap penguins\n")
```
```{r}
n.female.chinstraps = sum(female.chinstraps)
cat("There are", n.female.chinstraps, "female chinstrap penguins\n")
```
What are the sample mean bill lengths for the male and female chinstrap penguins?
Males:
```{r}
# can compute manually
sum(penguins$bill_length_mm[male.chinstraps])/n.male.chinstraps
# or using the mean function
mean(penguins$bill_length_mm[male.chinstraps])
```
and Females:
```{r}
mean(penguins$bill_length_mm[female.chinstraps])
```
Now compute the sample standard deviation for female penguins:
```{r}
sample.var <- sum((penguins$bill_length_mm[female.chinstraps] -
mean(penguins$bill_length_mm[female.chinstraps]))**2)/(n.female.chinstraps-1)
sqrt(sample.var)
```
The n-1 in the denominator of the sample standard deviation is "Bessel's correction". It ensures that the sample standard deviation is an unbiased estimate of the population standard deviation.
We can also use the sd function to get the same result:
```{r}
sd(penguins$bill_length_mm[female.chinstraps])
```
Now compute sample standard deviation for male chinstraps:
```{r}
sd(penguins$bill_length_mm[male.chinstraps])
```
Another common summary statistic is the median, which is the middle value of a sample.
```{r}
median(penguins$bill_length_mm[male.chinstraps])
```
```{r}
median(penguins$bill_length_mm[female.chinstraps])
```
### Exercise 2 - Visualizing Data
Summary statistics are really useful, but they often obscure characteristics of the data, such as the shape of the distribution or the presence of outliers. Compute summary statistics, but also visualize your data!
A histogram provides perhaps the simplest way to visualize data. Let's start by looking at bill length and bill depth.
# TODO: add image showing bill length and bill depth?
The simplest way to make a histogram is with the hist function.
```{r}
hist(penguins$bill_length_mm)
```
We can use the breaks parameter to control the binning:
```{r}
hist(penguins$bill_length_mm, breaks=30)
```
[ggplot2](https://ggplot2.tidyverse.org/index.html) provides more powerful methods for plotting. ggplots syntax can take some getting used to. I recommend using a [ggplot2 cheatsheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf) for reference.
```{r}
library(ggplot2)
ggplot(data=penguins, aes(bill_length_mm)) + geom_histogram()
```
Notice that the distribution looks bimodal. And remember we have 3 different species of penguin in our dataset. Let's see if the bill length differs by species. Let's make separate overlapping histograms in different colors for each species.
```{r}
ggplot(penguins, aes(x=bill_length_mm, fill=species)) + geom_histogram(position="identity", alpha=0.6, binwidth=2)
```
We can also visualize the distributions of the species as separate plots, using one of ggplot's faceting functions.
```{r}
ggplot(penguins, aes(x=bill_length_mm, fill=species)) + geom_histogram(binwidth=2) + facet_wrap(.~species)
```
If we want a more compact display, we can use boxplots or violin plots
```{r}
ggplot(penguins, aes(x=bill_length_mm, fill=species)) + geom_boxplot()
```
ggridges is a cool package that lets you plot overlapping density plots (smoothed histograms).
```{r}
library(ggridges)
ggplot(penguins, aes(x=bill_length_mm, y=species, fill=species)) + geom_density_ridges()
```
What if we want to view two variables at once for example, bill length and bill depth? We can use a scatter plot to view their joint distribution. We can use color and plotting shapes to indicate species and sex.
```{r}
ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm, color=species, shape=sex)) + geom_point()
```
Let's plot the distributions of bill lengths of male and female penguins separately for each species.
```{r}
ggplot(penguins, aes(x=bill_length_mm, fill=sex)) + geom_histogram(position="identity", alpha=0.6, binwidth=2) + facet_wrap(.~species)
```