day3.Rmd

---
title: |
  | Introduction to R for Disease Surveillance 
  | and Outbreak Forecasting: Day 3
subtitle: Designing data tables
author: |
  | Michael Wimberly, Dawn Nekorchuk and Andrea Hess,
  | Department of Geography and Environmental Sustainability, University of Oklahoma
date: October 24 2018, Bahir Dar, Ethiopia
output:
  pdf_document:
    toc: true
    toc_depth: 3
    number_sections: true
  html_document: default
---

\pagenumbering{gobble}

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


# Introduction

Today you will get an introduction to creating and manipulating data. We begin to more fully explore two more of  packages of the tidyverse: tibble, dplyr. The tibble package makes creating and viewing data frames easier, while the dplyr packages provides powerful functions for manipulating data frames. 
You will master those concepts to the point where you can work effectively with almost any dataset. You will  learn some useful new tools for working with single tables, including summarizing data by groups, organizing data so that it is *tidy*, and dealing with both explicit and implicit missing values. Next, you will learn to combine datasets using functions that work no two or more tables. 


# Data transformations

You now have a solid background in plotting data with ggplot. But what happens when your data is not already in the correct format for plotting? Or what if you want to plot only a subset of your data?

In this part of the demonstration, you will learn the basic commands for manipulating data frames (and tibbles). Such manipulations can be accomplished in a variety of ways, but the most intuitive and efficient way is by using the functions provided in the **dplyr** package.

dplyr is a member of the tidyverse which focuses exclusively on manipulating data frames.

Start by loading dplyr.

```{r}
library(dplyr)
library(readr)
```

The dplyr package has some excellent vignettes, including an Introduction to dplyr which can be accessed by running `vignette("dplyr")`

## Single table verbs

Each data transformation has a corresponding dplyr function that accomplishes it. These functions take the form of verbs that describe the action. Some dplyr verbs operate on a single data frame or tibble, while others operate on two or more tables. Here is a list of the main dplyr verbs for single tables:

* `filter()` selects observations (rows) based on their values.
* `arrange()` reorders  observations.
* `select()` and `rename()` select variables (columns) based on their names.
* `mutate()` and `transmute()` add new variables that are functions of existing variables.

### Filter

```{r}
mecha <- read_csv("data_mecha.csv")
mecha
filter(mecha, iso_week == 1)
filter(mecha, iso_year %in% 2015:2016)
filter(mecha, mal_case > 200, iso_year != 2016)
```

### Arrange

```{r}
arrange(mecha, iso_week)
```

Use `desc()` to order a column in descending rather than ascending order.

```{r}
arrange(mecha, desc(mal_case))
```

### Select and Rename

Select columns to keep by name. Other columns are removed from the data frame.

```{r}
select(mecha, woreda_name, iso_year, iso_week, mal_case)
```

Use the `:` operator to select a continuous series of columns. The helper functions `starts_with()`, `ends_with()`, and `contains()` can be used to find multiple columns by matching part of the column name.

```{r}
select(mecha, 
       woreda_name:mal_case)
```

```{r}
select(mecha, 
       starts_with("test"))
```

Remove columns by prefixing their names with a `-`. Other columns will be kept.

```{r}
select(mecha, 
       -test_pf_tot, 
       -test_pv_only)
```

Rename columns using the `=` operator, placing the new name first and the old name second.

```{r}
rename(mecha, district = woreda_name, 
       malaria_case = mal_case)
```

### Mutate and Transmute

Add new variables using `mutate()`. You can create multiple new variables at a time, and you can refer to ones you've just created. 

```{r}
mutate(mecha, 
       inc = mal_case / pop_at_risk,
       inc_per_1000 = inc * 1000)
```

If you only want to keep the new variables, use `transmute()`.

```{r}
transmute(mecha, 
       inc = mal_case / pop_at_risk,
       inc_per_1000 = inc * 1000)

```
For the next few exercises, you will use the `data_day3` dataset. Let's begin by loading the dataset again using `read_csv()`.

```{r, message=FALSE}
library(readr)
data <- read_csv("data_day3_subset.csv")
```

# Grouped summaries with `summarize()`

The only main dplyr verb we did not cover yesterday was `summarize()`, which collapses a data frame into a single row:

```{r, message=FALSE}
library(dplyr)
```


```{r}
summarize(data, n_cases = sum(mal_case, na.rm = TRUE))
```

Don't worry about the `na.rm` argument for now. We will cover that below.

`summarize()` becomes particularly useful when we pair it with another dplyr function, `group_by()`, which groups rows of data together to produce a "grouped data frame". When a grouped data frame is summarized, the summaries are for individual groups rather than the entire dataset, and the result is a data frame with one row per group.

```{r}
by_woreda_year <- group_by(data, woreda_name, iso_year)
summarized_cases <- summarize(by_woreda_year, n_cases = sum(mal_case, na.rm = TRUE))
summarized_cases
```

In this example, we calculated the total number of cases observed in each woreda in each year. In this case, we grouped by two variables, but you can group any number of variables you want.

We can also visualize the summaries with a column plot:

```{r, fig.height=3}
library(ggplot2)
ggplot(summarized_cases) +
  geom_col(aes(x = woreda_name, y = n_cases)) +
  facet_wrap(~ iso_year)
```

Or with as a time series plot:

```{r, fig.height=3}
ggplot(summarized_cases) +
  geom_line(aes(x = iso_year, y = n_cases)) +
  facet_wrap(~ woreda_name)
```

Grouped summaries are a powerful tool for data exploration or for transforming data into group summaries for futher analysis or visualization.

## Missing data

In the previous section, you saw the argument `na.rm = TRUE`. That argument told R to remove the missing values, identified by `NA`s, from the data before summarizing.

If you set `na.rm = FALSE`, which is the default value, then `NA`s are not removed. Any vector that includes an `NA` value will yield an `NA` value when summarized. The logic is that if you are missing some elements in a set of values, you can't accurately summarize the set. Thus we see the following behaviors:

```{r}
sum(c(1, 2, 3))
sum(c(1, 2, 3, NA))
sum(c(1, 2, 3, NA), na.rm = TRUE)
```

## Counts

When summarizing, it is often useful to count the number of values used. This can be done with the dplyr function `n()`, which will include `NA` values, or `sum(!is.na(x))` which will exclude them. It's often a good idea to count when summarizing to make sure your summaries are not based on a very small sample size.

```{r}
summarize(by_woreda_year, 
          n_cases = sum(test_pv_only, na.rm = TRUE),
          n_weeks = n())
```

If the only summary you want to perform is counting, you can also use the dplyr function `count()`. This is basically a shortcut to perform a grouped summarize.

```{r}
count(by_woreda_year, woreda_name, iso_year)
```

## Useful summary functions

Up to this point, you've only seen two summary functions: `sum()` and `n()`. Other useful summary functions include:

* Measures of location: `mean(x)` and `median(x)`. The mean is the sum divided by the length; the median is a value where 50% of `x` is above it, and 50% is below it.
* Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`. The mean squared deviation, or standard deviation or sd for short, is the standard measure of spread. The interquartile range `IQR()` and median absolute deviation `mad(x)` are robust equivalents that may be more useful if you have outliers.
* Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`. Quantiles are a generalisation of the median. For example, `quantile(x, 0.25)` will find a value of `x` that is greater than 25% of the values, and less than the remaining 75%.
* Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`. These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements).
* Counts: You’ve seen `n()`, which takes no arguments, and returns the size of the current group. To count the number of non-missing values, use `sum(!is.na(x))`. To count the number of distinct (unique) values, use `n_distinct(x)`.

# Working with epidemiological weeks (WHO ISO 8601)

Up until this point we have been working with epidemiological datasets organized by epidemiological week number (or iso week) and epidemiological year (or iso year). We will use a trimmed down version of our full (without missing rows) dataset earlier in the day as an example. 

```{r}
day3 <- read_csv("data_day3.csv")
incid <- select(day3, woreda_name, iso_year, iso_week, pfm_inc, pv_inc)
incid
count(incid, woreda_name)
```

We can see that we have n weeks of data for each woreda. If we wanted to plot this data as a single time series of n weeks, we might do something like this:

```{r}
ggplot(incid) +
  geom_line(aes(x = iso_week, y = pfm_inc)) +
  facet_wrap(~ woreda_name, ncol = 1, scales = "free_y")
```

The problem is that ggplot does not know that we have data from different years, so for each week it plots all years on top of each other and tries to connect them with a line. Unfortunately, the aesthetic mappings in ggplot plot must be 1 to 1, meaning you can't map two columns in the data frame (iso_year and iso_week) as the x axis values.

In this situation we need to add a new column to our dataset that contains properly ordered values to use as the x-axis variable. But what values do we use? We could simple number the rows as 1 to n for each woreda, but then we lose the date-related information on the x axis when we plot it. The solution is to convert from epi weeks to dates, i.e. the Date class you learned about on day 1. Dates contain all the necessary information to be ordered correctly, and they fit in a single column.

Although the lubridate package in R contains many useful functions for working with dates in general, it does not yet include a function to convert from iso weeks and iso years to calendar dates. For that we will need to use a custom function, which we will load (or "source") from another R file. 

```{r}
library(lubridate)
source("date_functions.R")
```

If you look at the Environment tab in RStudio you will see that it now includes a function named `make_date_yw()`. This function takes vectors of iso years, week numbers (1--53), and weekdays (1--7) and returns a vector of Dates. For example:

```{r}
make_date_yw(year = 2017, week = 1, weekday = 1)   # first day of 2017 week 1
make_date_yw(2017, 1, 2)                           # second day of that week
make_date_yw(2017, 1)                              # weekday defaults to 1
make_date_yw(2015:2017, 1:3)                       # arguments can be vectors
make_date_yw(2017, 1:3)                            # arguments are "recycled"
```

Other very useful functions exist in the `lubridate` package. 

```{r}
isoweek(Sys.Date())  # today's ISO week number
isoyear(Sys.Date())  # today's ISO year
```

Because `make_date_yw()` is accepts vectors as arguments, we can use it with `mutate()`.

```{r}
incid_date <- mutate(incid, date = make_date_yw(iso_year, iso_week))
incid_date <- select(incid_date, woreda_name, iso_year, iso_week, 
                     date, pfm_inc, pv_inc) # put the new column after date
incid_date
```

Now when we plot, we can map the x axis value to the date column `date`:

```{r}
ggplot(incid_date) +
  geom_line(aes(x = date, y = pfm_inc)) +
  facet_wrap(~ woreda_name, ncol = 1, scales = "free_y")
```

R automatically adjusts the x-axis labels based on the range of dates. In this case, the major axis lines and labels are on January 1 of each year and the minor axis lines are on June 1. Notice how the labels change when a smaller set of data is used:

```{r, fig.height=3}
# plot a single year of Mecha incidence
ggplot(filter(incid_date, woreda_name == "Mecha", iso_year == 2016)) +
  geom_line(aes(x = date, y = pfm_inc))
# plot a few months of Mecha incidence
ggplot(filter(incid_date, woreda_name == "Mecha", iso_year == 2016, iso_week > 40)) +
  geom_line(aes(x = date, y = pfm_inc))
```

For epidemiological data, of course, it might be more useful to label the weeks by iso week number rather than date. In this case, we can use a new function `scale_x_date()` to modify how the x axis labels appear. `scale_x_date()` takes an argument named labels, which can be defined in several ways (see the help page), but importantly for our purposes it can be given the name of a function which converts Dates to character or numeric values to use as labels. We have just such a function in `isoweek()`!

```{r, fig.height=3}
# plot a single year with iso week labels
ggplot(filter(incid_date, woreda_name == "Mecha", iso_year == 2016)) +
  geom_line(aes(x = date, y = pfm_inc)) +
  scale_x_date(labels = isoweek) +
  labs(x = "ISO Week") # dont forget to change the axis label to match
```

# Tidying data with the tidyr package

The concept of **tidy** data has become increasing popular among R users over the past few years thanks to the tidyr package, and more recently the tidyverse family of packages. Tidy data refers to data tables in which each row is an **observation**, each column is a **variable**, and each cell is a **value**.

Tidy data has three main advantages: it is consistent and therefor easily replicated and easily taught, R's emphasis on vectors works well when variables are organized into columns, and an increasing number of R packages are designed to work on tidy data, e.g. those in the tidyverse, but also dozens of others and more every month.

Unfortunately data is not always stored in tidy format. For example, data is often stored in a format that facilitates data entry. The functions in the tidyr package help you tidy your data so that it may be used for further visualization and analysis.

## Spreading and gathering

Two common problems with un-tidy datasets are when one variable is spread across multiple columns or one observation is spread across multiple rows. In these cases, you can use the two tidyr functions `gather()` and `spread()`.

As an example, let us look at our `incid_data` data again.

```{r}
incid_date
```

Both pfm_inc and pv_inc are values of incidence, the first for *P. falciparum* and mixed infection malaria, the second for *P. vivax* malaria. In that sense, *falciparum* and *vivax* are not variables, but rather values identifying the malaria-causing agent. Both columns contain incidence values, and each row represents two observations, not one.

To tidy a dataset like this we need to **gather** those columns into two new variables, one that holds the name of the malaria agent and the other that holds the incidence.

The names of the columns will become values in a new variable called the **key**, which we will call "agent". The values of the two columns will become a single new *value* variable, which we will call "inc".

```{r}
library(tidyr)
incid_tidy <- gather(incid_date, key = agent, value = inc, 
                     pfm_inc, pv_inc) # these are the columns to gather
incid_tidy
```

Now that we have the dataset in tidy format, we can plot it and "map" the new agent variable to the color aesthetic in the plot.

```{r}
ggplot(incid_tidy) +
  geom_line(aes(date, inc, color = agent)) +
  facet_wrap(~ woreda_name, ncol = 1, scales = "free_y")
```

When observations are scattered across multiple rows, we need to **spread** the data into multiple columns. The `incid_tidy` dataset is already in tidy format, but we can use it as an example for how to use `spread()`:

```{r}
library(tidyr)
spread(incid_tidy, key = agent, value = inc)
```

## Separating and uniting

Sometimes you may find that a single column contains multiple values. In that case, you need to **separate** the two values into multiple columns.

Image we have a dataset where the iso week is provided in year-week format, e.g. "2017-W01". 

```{r}
yw_example <- tibble(yw = c("2016-W01", "2016-W02", "2016-W03"),
                   mal_case = c(2565, 2042, 1803))
yw_example
```


In this case, 2017 is the year and 1 is the week number. You can use `separate()' to put them into their own columns. Here, `col` is the column to separate, `into` is a character vector of the names of the new columns, `sep` is the character string separating the two values, and `convert = TRUE` causes the new columns to be converted from character to numeric format if possible.

```{r}
yw_sep <- separate(yw_example, col = yw, 
                   into = c("iso_year", "iso_week"),
                   sep = "-W", convert = TRUE)
yw_sep
```

The complement of `separate()` is `unite()`, which can be used to **unite** multiple columns into one. 

```{r}
unite(yw_sep, col = "yw", iso_year, iso_week, sep = "-W")
```

## Dealing with missing rows

An important distinction can be made with regards to missing values. Missing values can be missing in one of two ways: **explicitly**, as with `NA` values, or **implicitly**, as when they are simply not present in the data.

To illustrate this point, we will modify our `incid` data created above by removing some of the rows to create implicitly missing values.

To begin with, `incid` has n weeks of data for each woreda.

```{r}
count(incid, woreda_name)
```

Now let us exclude some of the rows using the dplyr function `sample_frac()`, which selects a random fraction of rows to keep and discards the rest. First we will `set.seed()` so that the R's random number generator will yield the same results for everybody.

```{r}
set.seed(1)
incid_incompl <- sample_frac(incid, 0.9)
count(incid_incompl, woreda_name)
```

As you can see, each woreda is missing some weeks of data. To convert these missing weeks from implicitly missing to explicitly missing, we can use the dplyr function `complete()`.

```{r}
complete(incid_incompl, woreda_name, iso_year, iso_week)
```

New rows have been added and `NA`s have been inserted for each variable not named in `complete`, in this case pfm_inc and pv_inc.

But wait! We can see that at least some of the weeks that were added were never in `incid` to begin with. The original dataset before being "completed" began on 2012 week 28. Now it begins on week 1. That's because `complete()` created new rows for every distinct combination of iso_year and iso_week.

What if we only wanted combinations of year and week that were already in the data, i.e. we want the completed dataset to start on 2012 week 28? The answer is to use the dplyr helper function `nesting()`, which causes `complete()` to only add existing combinations of two or more variables, in this case iso_year and iso_week.

```{r}
complete(incid_incompl, woreda_name, nesting(iso_year, iso_week))
```

Now we're back to starting on week 28 and having n rows, the same number we had before we randomly removed rows. And we now have a dataset "complete" with explicitly missing values.

## Combining datasets with dplyr

A common need when working with multiple data files is to combine the data into a single dataset. There are two conceptual ways to do this: **binding** and **joining**.

Binding two tables together matches rows or columns by position. The datasets to be bound need to have the same variables or the rows need to be aligned for binding to work.

Joining two tables together matches rows by value rather than position. The datasets need to have some variable in common between them for joining to work.

### Binding tables

If data have the same columns, you can use `bind_rows()` to "stack" them one on top of the other in a new data frame.

```{r, message=FALSE}
mecha <- read_csv("data_mecha.csv")
fogera <- read_csv("data_fogera.csv")
mf_bind <- bind_rows(mecha, fogera)
mecha
count(mf_bind, woreda_name, iso_year)
```

If data contain different variables that belong to the same observation, you can combine them by row position using `bind_cols()`. For this to work, the two data frames must have the same number of rows, and the same rows in each dataset must belong the the same observation.

```{r, results="hide"}
mecha_mal_case <- select(mecha, woreda_name, iso_year, iso_week, mal_case)
mecha_pf_pv <- select(mecha, c("test_pf_tot", "test_pv_only"))
bind_cols(mecha_mal_case, mecha_pf_pv)
```

## Joining tables

Table joins are a concept shared across many data science disciplines and are implemented in relational database management systems such as MySQL.

With a join, two tables are connected to each other conceptually through variables called **keys**, which are variables found in both tables. For the datasets you've seen so far, possible key columns include woreda, iso_year, or iso_week.

For the following exercises, we will use a new dataset found in `data_ndwi.csv`. This dataset contains satellite-derived values of NDWI, normalized difference water index, an index of vegetation water content (VWC). VWC is related to the abundance of suitable mosquito breeding habitat and is therefore useful when predicting future malaria outbreaks. This dataset is a subset containing years 2016, 2017, and 2018. 

The problem is that the NDWI data is daily, while our epidemiological data is weekly. If we want to connect one to the other, we will need to perform two main tasks: summarize the daily data by epi week, and join the new weekly environmental data to the weekly epidemiological data.


```{r, message=FALSE, fig.height=3}
ndwi <- read_csv("data_ndwi.csv")
glimpse(ndwi)  # like str() but but simpler output
ggplot(filter(ndwi, woreda_name=="Mecha")) +
  geom_line(aes(x = obs_date, y = obs_value))
```

The next step is to the daily data by iso week. Unfortunately, we do not have columns for iso_year and iso_week. 

```{r}

# convert date to epi year and epi week
ndwi <- mutate(ndwi, 
               iso_year = isoyear(obs_date), 
               iso_week = isoweek(obs_date))
ndwi
```

Note that the calendar year in our obs_date column is not always the same as iso_year:

```{r}

filter(ndwi, year(ndwi$obs_date) != iso_year, woreda_name == "Mecha")
```

This is exactly why we must calculate iso_year rather than using calendar year. If we used calendar year, some values would not be grouped correctly when we summarize.

Now we can summarize by epi week.

```{r}
ndwi_wk <- group_by(ndwi, woreda_name, iso_year, iso_week)
ndwi_wk <- summarize(ndwi_wk, 
                     ndwi = mean(obs_value, na.rm = TRUE),
                     n = sum(!is.na(obs_value)))
ndwi_wk
```

However, many of those mean values were estimated using fewer than half the days in the week, and may therefor be poor approximations of the true mean. To be safe, we will exclude summaries calcualted from fewer than 6 days of the week.

```{r}
ndwi_wk <- filter(ndwi_wk, n >= 6)
ndwi_wk
```

At this point, we are ready to join the NDWI data with the epidemiological data.

### Inner join

The first type of join we will perform is called an **inner join**. With this type of join, rows are included in the output only if there are matching key values in both tables.

We will start with our `mecha` dataset and join the `ndwi_wk` dataset using `inner_join()`. Before we do that, let's remove some columns from `mecha` so we can more easily see what is happening when we join.

```{r}
mecha_mal_case <- select(mecha, woreda_name, iso_year, iso_week, mal_case)
```

Because our datasets have three columns in common (woreda, iso_year, iso_week) and it takes all three columns to uniquely identify an observation in one datset or the other, all three columns will serve as our keys.

```{r}
inner_join(mecha_mal_case, ndwi_wk, by = c("woreda_name", "iso_year", "iso_week"))
```

Looking at the resulting table, we can see that the columns in `mecha` are all present in the output, in the same order, followed by the non-key columns from `ndwi_wk`. 

We can also see that the resulting dataset has only n rows. `mecha` had n rows, so some rows must not have had a key match between the two tables. Examining more closely and we can see that YYYY week W was not present in `ndwi_wk` and is also not present in the joined tables.

### Outer joins

It may be perfectly fine to have rows with non-matching keys removed by an inner join. It depends on your goal. However for our demonstration lets assume we want to keep all rows from `mecha`. This is possible using a **left join**. With this type of join, all rows from the table on the "left" side of the join are kept, and missing values are inserted where there is no matching row in the table on the "right" side.

We can perform a left join using `left_join()`, where the first argument is the left table and the second argument is the right table.

```{r}
lj <- left_join(mecha_mal_case, ndwi_wk, by = c("woreda_name", "iso_year", "iso_week"))
lj
```

You can see that the resulting table still has n rows. The first row, for YYYY week W, was missing after the left join. Here it is present, with missing values for the variables ndwi and n, the two that came from the table on the right.

Using `gather()` we can easily plot the joined dataset.

```{r}
ggplot(gather(lj, key, value, mal_case, ndwi)) +
  geom_line(aes(iso_week, value, color = factor(iso_year))) +
  facet_wrap(~ key, ncol = 1, scales = "free_y")
```

# Day 3 exercises

1. Read the `data_2016_2018.csv` dataset.
2. Summarize the data. Calculate the min, max, and mean values of mal_case for each woreda.
3. Now summarize the data by woreda *and* year.
4. Starting with the data_mecha.csv data, create a tidy data frame with columns for iso_year, iso_week, and malaria agent (P. falciparum vs. P. vivax). Unnecessary columns can be removed.
5. Plot this tidy data frame using colors for malaria agent and facets for years.
6. Create a data frame that contains both the daily *and* weekly NDWI data for a woreda and year (2016 - 2018) of your choice. Plot the values on the same plot, with Date on the x axis and NDWI on the y axis. Use either point or line geoms and choose aesthetics to help differentiate the daily and weekly data.
7. Create a data frame that contains both the daily and weekly NDWI data for a woreda and year of your
choice. Plot the values on the same plot, with Date on the x axis and NDWI on the y axis. Use either
point or line geoms and choose aesthetics to help differentiate the daily and weekly data.