-
Notifications
You must be signed in to change notification settings - Fork 1
/
day1b.Rmd
455 lines (335 loc) · 19.7 KB
/
day1b.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
---
title: |
| Introduction to R for Disease Surveillance
| and Outbreak Forecasting: Day 1
subtitle: First steps toward using R - basic data objects and functions
author: |
| Michael Wimberly, Dawn Nekorchuk and Andrea Hess,
| Department of Geography and Environmental Sustainability, University of Oklahoma
date: October 22 2018, Bahir Dar, Ethiopia
output:
pdf_document:
toc: true
toc_depth: 2
number_sections: true
html_document: default
---
\pagenumbering{gobble}
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
This tutorial will provide an introduction to R, an open-source computer software system for designed for statistical computation and graphics. R includes an interpreted language combined with a run-time environment that provides scripting, graphics, and debugging capabilities. By learning how to use the R language and environment, you will gain access to an large collection of functions for data processing, data visualization, and statistical modeling that are freely available through add-on packages.
Today, we will start with the basics. You will first learn about the R *objects* that store various types of data. Then you will learn how to manipulate these objects using *functions* that typically take one or more objects as inputs, and then modify these inputs to produce a new object as the output. These concepts will be demonstrated by showing how R can be used to perform practical tasks, including making simple calculations, applying these calculations over larger datasets, querying subsets of data that meet one or more conditions, and carrying out standard statistical tests.
# Working with R objects
## Using R as a calculator
At the most basic level, R can be used as a simple calculator. You can type in just about any mathematical expression on the command line in the console, hit the Enter key, and R will give the answer. Alternately, you can type the mathematical expressions as lines in a script file and then Run one or more lines, which is the way that we will be presenting this demonstration. The output is printed to the console by default.
```{r expressions}
33
23 + 46
(160 + 13) * 2.1
237.81 / (3.7^2 + sqrt(165.3)) - 26.23
```
In addition to just printing the results, we can save the outputs of mathematical expressions as variables using the left assignment operator `<-`. Creating variables allows us to store and reuse the information at a later time. R stores each variable as an *object*. It is helpful to choose object names that make it easy to remember the type of information that is stored.
```{r variables}
x <- 15
x
y <- 23 + 15 / 2
y
z <- x + y
z
case <- 238
population <- 34226
incidence <- 10000 * case/population
incidence
```
When we enter an object name on the command line or specify an object name on the line of a script, R will invoke the `print` method by default and output its value to the console. Note that the following two lines produce the same output.
```{r variables 2}
x
print(x)
```
## Vectors
When we work with real data, we typically need to keep track of multiple measurements of our variables. For example, we may have data on malaria cases at multiple health facilities and data collected over multiple weeks, months, and years. Therefore, we usually work with vectors, which are objects that can contain multiple values. For example, consider that we have malaria case data for twelve months of the year. We can use the `c()` (combine) function to assign these data to a vector object.
```{r vectors}
cases <- c(43, 56, 23, 67, 81, 150, 110, 122, 161, 238, 138, 74)
cases
```
In R, there are several handy functions that can be used to create regular sequences of numbers or repeated values. In this example, we would like to have a a vector containing month numbers from 1 through 12, and another vector that repeats the same annual population value for each month. We will accomplish this task by calling some R functions and providing arguments that specify how to create the vectors.
```{r vectors 2}
month <- seq(from=1, to=12, by=1)
month
month2 <- 1:12
month2
pop <- rep(x=population, times=12)
pop
```
One of the best features of R is that it allows us to to use vector arithmetic to carry out element-wise mathematical operations without having to write code for looping. Note that if we specify a a single value, it will automatically be repeated for the entire length of the vector. So the following two statements produce the same output.
```{r vectors 3}
cases * 2
cases2 <- c(22, 5, 13, 16, 22, 34, 24, 56, 431, 88, 67, 45)
totcases <- cases + cases2
totcases
inc <- 10000 * cases/pop
inc
```
Vectors can also be used in more complex mathematical expressions and supplied as arguments to functions.
```{r vectors 4}
mean(inc)
sum(cases)
length(inc)
sum(inc) / length(inc)
var(cases)
sum((cases - mean(cases))^2) / (length(cases)-1)
```
Subsets of data can be selected by specifying index numbers within brackets. Positive numbers are used to include subsets, and negative numbers are used to exclude subsets.
```{r vectors 5}
cases[5]
cases[c(1, 3, 8, 11)]
cases[9:12]
cases[-2]
```
Logical vectors can be used to select subsets that meet a given criterion or to assign a new value to a subset
```{r vectors 6}
inc > 10
incgt10 <- inc > 10
class(incgt10)
inc[incgt10]
inc2 <- inc[inc > 10]
inc2
```
In addition to numeric and logical vectors, we can also have vectors of character data such as month names and we can be subset our data using these names.
```{r vectors 7}
mname <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
class(mname)
names(inc) <- mname
inc[c("Jun", "Jul", "Aug")]
```
There is also a special type of vector called a factor that is used for categorical data analysis. If we wanted to use the vector of woreda names in an analysis comparing statistics across the different woredas, we need to convert it into a factor.
```{r vectors 9}
worname <- c("Mecha", "Dembecha", "Libokemkem", "Fogera", "South Achefer")
worfact <- factor(worname)
worname
class(worfact)
worfact
```
In some cases, we may also want to convert numerical vectors to factors. Consider an experiment in which patients have either been given an experimental drug treatment or placed in a control group. The treatment category is represented in the data by integer codes, where 1 = treatment and 2 = control. The following code converts the numerical data into a factor.
```{r}
treat_num <- c(2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1)
summary(treat_num)
treat_fac <- factor(treat_num, labels = c("treatment", "control"))
summary(treat_fac)
```
## Matrices and lists
Multiple vectors of the same length can be combined to create a matrix, which is a two-dimensional object with columns and rows. All of the values in a matrix must be the same data type (for example, numeric, text, or logical). We can use the `cbind()` function to combine the vectors as columns and the 'rbind()' function to combine the vectors as rows.
```{r matrices 1}
mat1 <- cbind(month, cases, pop, inc)
mat1
class(mat1)
rownames(mat1)
colnames(mat1)
```
Matrix elements can be extracted via subscripting in [row, column] format. Leaving a subscript blank will return all columns or rows.
```{r matrices 2}
dim(mat1)
mat1[1,1]
mat1[,4]
mat1[2,]
```
Lists are ordered collections of objects. They are created using the list function and can be subscripted by component number or component name. Lists differ from vectors and matrices in that they can contain a mixture of different data types.
```{r lists}
my_list <- list(mycases = cases, mypop = population)
my_list
my_list[[1]]
my_list[[2]]
my_list$mycases
my_list[["mypop"]]
```
## Data frames
Data frames are like matrices in that they have a rectangular format consisting of columns and rows, and are like lists in that they can contain columns with multiple data types such as numeric, logical, character, and factor.
```{r dataframes 1}
epi_data <- data.frame(mname, month, cases, pop, inc)
attributes(epi_data)
summary(epi_data)
```
The columns of a data frame can be accessed as list elements.
```{r dataframes 2}
epi_data$mname
epi_data$inc
```
Data frames can also be accessed via matrix style subscripting of rows and columns.
```{r dataframes 3}
epi_data[3, 4]
epi_data[1:3,]
epi_data[,3]
epi_data[,"inc"]
```
Conditional statements can also be used to extract data records meeting certain conditions. Alternately, the `subset()` function can be used to select rows from the data frame using a conditional statement.
```{r dataframes 4}
epi_data[epi_data$inc > 10,]
epi_data[epi_data$mname == "Jun",]
new_data <- epi_data[epi_data$month >= 7,]
new_data
new_data2 <- subset(epi_data, month >= 7)
new_data2
```
# Tibbles
Throughout the rest of this workshop we will be working with "tibbles" instead of R's traditional "data frame". Tibbles actually _are_ data frames, but they change the way data frames work to reduce some of the problems caused by the limitations of the older `data.frame` class.
If you're wondering where the name came from, tibbles originally had the class `tbl_df`, which stood for "table" and "data frame". People soon began pronouncing this new class "tibble diff", and ultimately just "tibble".
Tibbles are defined in the tibble package, which can be loaded with the library function.
```{r}
library(tibble)
```
If you run this code and get the error message “there is no package called ‘tibble’”, you’ll need to first install it, then run `library()` once again.
```{r, eval=FALSE}
install.packages("tibble")
library(tibble)
```
You only need to install a package once, but you need to reload it every time you start a new session.
## Creating tibbles
Most of the functions in the tidyverse create tibbles, and all of them can accept tibbles as arguments. However many packages in R use traditional data frames. in which case you might want to coerce a data frame to a tibble using `as_tibble()`. Here is an example using the built in `iris` dataset, a data frame.
```{r}
as_tibble(iris)
```
Here you can see the built-in iris dataset has 150 rows and 5 columns.
You can create a new tibble from individual vectors with `tibble()`. `tibble()` will automatically recycle inputs of length 1, and allows you to refer to variables that you just created, as shown below.
```{r}
df <- tibble(
woreda = "Mecha",
iso_week = 1:5,
iso_year = 2017,
cases = c(10, 12, 7, 9, 5),
population = 5129,
incidence = cases / population * 1000
)
df
```
## Printing tibbles
Like other objects, a tibble can be printed in two ways.
```{r}
print(df)
```
or simply
```{r}
df
```
There are three main differences between how a tibble is printed in your console and how a data frame is printed. We use a large dataset containing epidemiological data to illustrate some differences. First we load the data.
```{r}
library(readr)
data <- read_csv(file = "data_day1-2.csv")
```
We won't print it here because it is too long, but to see what the data looks like in your console when printed as a `data.frame`, run this code:
```{r, results="hide"}
as.data.frame(data)
```
Unless you have a very wide screen, you probably only see the last few columns and some of the rows, along with the message `reached getOption("max.print")`. The column names and first few rows are not visible unless you scroll up in your console window quite a bit. Compare that to how `data_day1-2` looks when printed as a tibble:
```{r}
data
```
First, only the first 10 rows of the data frame are printed. Second, only columns that fit on the screen are printed. The remainder are listed at the bottom. Third, the class of each column is given beneath its name.
As a whole, these improvements make printing tibbles much easier on the eyes than printing data frames.
## Using tibbles with older functions
Some older functions don’t work with tibbles. If you find yourself unable to run an older function with a tibble, use `as.data.frame()` to turn it into a traditional data frame.
```{r, eval=FALSE}
as.data.frame(data)
```
# Additional topics
Before moving on, it is necessary to mention a few more features of R that will come up in our demonstrations and exercises throughout the week.
## Coercing data to differnet classes
As we will see in the next section, we accomplish various tasks in R by calling functions and supplying various objects as arguments to these functions. In some cases, we may need to quickly convert data to a different object type before providing it as a function argument. We can do this by coercing the data into a new class using functions like `as.factor()`, `as.character()`, and `as.numeric()`.
```{r}
myvector2 <- c(2012, 2012, 2012, 2013, 2013, 2013)
summary(myvector2)
levels(myvector2)
summary(as.factor(myvector2))
levels(as.factor(myvector2))
as.character(myvector2)
```
## Dates in R
In disease surveillance and environmental monitoring, we often need to keep track of the time when a particular data record was collected. In R, there is a special object class for storing date information.
```{r}
today <- Sys.Date()
today
class(today)
```
There are variety of ways to create date objects. One simple way is to use the `as.Date()` function to coerce a text object into a date object. Date objects can also coerced back to text or numeric objects. Later in the workshop, we will use some more advance functions that can convert epidemiological weeks into dates and calculate the day of the year based on a date object.
```{r}
datetxt <- c("2017-12-30", "2018-1-2", "2018-1-3", "2018-1-4")
summary(datetxt)
class(datetxt)
dateobj <- as.Date(datetxt)
summary(dateobj)
class(dateobj)
as.character(dateobj)
as.numeric(dateobj)
```
For working with epidemiological data, we might also want to find out what week or year a date falls in, following the ISO conventions. The package `lubridate` helps us with these conversion from date objects to ISO weeks and ISO years.
```{r}
library(lubridate)
isoweek(dateobj)
isoyear(dateobj)
```
## Missing data - the `NA` symbol
The `NA` symbol is a special value used in R to indicate missing data. When processing and managing data, it is critical that missing data be appropriately flagged as `NA` rather treated as zero or some other arbitrary value. Most R functions have build-in methods for handing missing data, and as we will see in later examples it is often necessary to choose the most appropriate method for a particular analysis. Below are some quick examples.
```{r}
myvector <- c(2, NA, 9, 2, 1, NA)
is.na(myvector)
sum(is.na(myvector))
mean(myvector)
mean(myvector, na.rm=T)
```
# Basic statistical analysis
Functions are also used to carry various types of statistical tests in R. For example, to compute confidence intervals for a particular variable in our data frame, we can call the `t.test()` function. Argument `x` is the data. The default confidence level is 0.95, but we can specify a value for the `conf.level` argument if we want something different.
```{r stats 1}
demo_data <- read_csv(file = "data_day1-2.csv")
t.test(x=demo_data$mal_case)
t.test(x=demo_data$mal_case, conf.level=0.9)
t.test(x=demo_data$mal_case, conf.level=0.99)
```
How does the `t.test()` function know what confidence level to use if we do not specify the `conf.level` argument? In many cases, arguments have a default value. If the argument is no specified in the function call, then the default value is used. When using statistical functions in R, it is particularly important to study the documentation to learn the default values and decide if they are sufficient for the analysis.
```{r stats 1b, eval=FALSE}
help(t.test)
```
To conduct a two-sample t-test, we can call the same function and specify a formula where the variable to be compared is specified to the left of the tilde (`~`) symbol and a variable indicating the group to which each observation belongs is specified on the right.
```{r stats 2}
t.test(mal_case ~ woreda_name, data=demo_data)
```
The following code carries out a two-way analysis of variance of outpatient malaria cases by woreda and year. The `aov()` function take two arguments, a model formula and the data object and generates an object of class aov that contains the results of the analysis. Note that we use the `as.factor()` function to coerce iso_year into a factor. We the supply the aov object as an argument to the `anova()` function to produce a standard analysis of variance table. We can also supply it as an argument to the `TukeyHSD()` function to generate a multiple comparisons object. This object can then be supplied as an argument to the `print()` and `plot()` functions to view the results.
```{r stats 3}
demo_aov <- aov(mal_case ~ woreda_name + as.factor(iso_year), data=demo_data)
class(demo_aov)
anova(demo_aov)
demo_mc <- TukeyHSD(demo_aov)
print(demo_mc)
plot(demo_mc)
```
The following code carries out a linear regression analysis of positive tests for *Plasmodium vivax* only (dependent variable) as a function of positive tests for *Plasmodium falciparum*/mixed (independent variable). We are using a square root transformation to help reduce right skewness of the distribution shape of the case counts. As with the `aov()` function, the `lm()` function takes a formula and a data argument and returns an object belonging to class `lm`. We can then use various other functions to summarize the lm object in various ways and to extract coefficients, fitted values, and residuals.
```{r stats 4, fig.height=3}
demo_lm <- lm(sqrt(test_pv_only) ~ sqrt(test_pf_tot), data=demo_data)
class(demo_lm)
attributes(demo_lm)
summary(demo_lm)
anova(demo_lm)
coef(demo_lm)
library(ggplot2)
qplot(x=fitted(demo_lm), y=resid(demo_lm), geom=c("point", "smooth"))
```
# Day 1 exercises
Run through the demonstration yourself by typing all of the commands above into a script on your computer, running each line of the script, and then examining the output on your computer. Whenever a new function is introduced, use the `help()` function to look it up and examine the various arguments that can be specified. Experiment with changing some of the arguments to see how they affect the output of the function.
Then, to practice what you have learned, try out the following exercises:
The following set of 24 numbers represents *Plasmodium falciparum* cases collected from January 2015 through December 2016 at a health center. Create a vector called `pfcases` containing these numbers.
81 55 79 107 135 210 166 175 186 355 228 126 81 65 53 62 54 88 107 93 115 157 94 54
The following set of 24 numbers represents *Plasmodium vivax* cases collected from January 2015 through December 2016 at the same health center. Create a vector called `pvcases` containing these numbers.
51 43 66 75 89 146 120 99 115 186 150 86 61 50 33 22 45 76 79 70 80 105 74 33
Using these data, write R code to carry out the following tasks:
1. Calculate the sum of all *Plasmodium vivax* cases over the two years.
2. Calculate the mean number of monthly *Plasmodium falciparum* over the two years.
3. Generate a logical vector that indicates which elements of `pfcases` have a value greater than or equal to 100.
4. Generate a vector that contains the number of the month (integer values 1 through 12) corresponding to each element of `pfcases` and `pvcases`.
5. Generate a vector that contains the number of the year (2015 or 2016) corresponding to each element of pfcases and pvcases.
6. Generate a new vector that contains the population corresponding to each element of `pfcases` and `pvcases`, assuming a population of 23,000 in 2015 and 23,500 in 2016.
7. Generate a new vector that contains the total number of malaria cases in each month.
8. Generate a new vector that contains the total malaria incidence (per 1,000 population) in each month.
9. Combine all of the vectors that you have created to make a data frame.
10. Use the `subset()` function to create a new data frame that contains only data for 2016.
11. Use the `lm()` function to carry out a linear regression with pfcases as the dependent variable and pvcases as the independent variable.