-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy patheda_and_univariate_brfss.Rmd
143 lines (103 loc) · 2.73 KB
/
eda_and_univariate_brfss.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
title: "Data exploration and univariate statistics"
---
<!-- Thanks to Martin Morgan for much of this material! -->
```{r style-A3, echo = FALSE, results = 'asis'}
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```
## Behavioral Risk Factor Surveillance System
We will explore a subset of data collected by the CDC through its
extensive Behavioral Risk Factor Surveillance System ([BRFSS][])
telephone survey. Check out the link for more information. We'll look
at a subset of the data.
First, we need to get the data. Either download the data
from [THIS LINK](BRFSS-subset.csv) or have R do it directly from the
command-line (preferred):
```{r}
download.file('https://raw.githubusercontent.com/seandavi/ITR/master/BRFSS-subset.csv',
destfile = 'BRFSS-subset.csv')
```
<!--
```{r echo=FALSE}
path <- "BRFSS-subset.csv"
```
-->
```{r ALL-choose-A3, eval=FALSE}
path <- file.choose() # look for BRFSS-subset.csv
```
```{r ALL-input-A3}
stopifnot(file.exists(path))
brfss <- read.csv(path)
```
## Learn about the data
Using the data exploration techniques you have seen to explore the brfss dataset.
- summary()
- dim()
- colnames()
- head()
- tail()
- class()
- View()
You may want to investigate individual columns visually using plotting like `hist()`. For categorical
data, consider using something like `table()`.
## Clean data
_R_ read `Year` as an integer value, but it's really a `factor`
```{r}
brfss$Year <- factor(brfss$Year)
```
## Weight in 1990 vs. 2010 Females
- Create a subset of the data
```{r}
brfssFemale <- brfss[brfss$Sex == "Female",]
summary(brfssFemale)
```
- Visualize
```{r}
plot(Weight ~ Year, brfssFemale)
```
- Statistical test
```{r}
t.test(Weight ~ Year, brfssFemale)
```
## Weight and height in 2010 Males
- Create a subset of the data
```{r}
brfss2010Male <- subset(brfss, Year == 2010 & Sex == "Male")
summary(brfss2010Male)
```
- Visualize the relationship
```{r}
hist(brfss2010Male$Weight)
hist(brfss2010Male$Height)
plot(Weight ~ Height, brfss2010Male)
```
- Fit a linear model (regression)
```{r}
fit <- lm(Weight ~ Height, brfss2010Male)
fit
```
Summarize as ANOVA table
```{r}
anova(fit)
```
- Plot points, superpose fitted regression line; where am I?
```{r}
plot(Weight ~ Height, brfss2010Male)
abline(fit, col="blue", lwd=2)
# Substitute your own weight and height...
points(73 * 2.54, 178 / 2.2, col="red", cex=4, pch=20)
```
- Class and available 'methods'
```{r, eval=FALSE}
class(fit) # 'noun'
methods(class=class(fit)) # 'verb'
```
- Diagnostics
```{r, eval=FALSE}
plot(fit)
# Note that the "plot" above does not have a ".lm"
# However, R will use "plot.lm". Why?
?plot.lm
```