wk04_tidyr.Rmd

---
title: 'Week 4: Tidying up Data'
author: "Ben Best"
date: '`r format(Sys.time(), "%Y-%m-%d %H:%M")`'
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval=FALSE)
```

## Precursors

### Final Project

The instructions for your class group's [Final Project](./#final-project) have been updated.

### Schedule

The class for week 9 on March 4th conflicts with the [Bren Group Project](http://www.bren.ucsb.edu/research/gp_submit.html) presentations, so we'll be extending the classes before and after by 1 hour.

### Invitations

- **organizations**: invite @bbest and @naomitague to your `github.com/<org>` so we can directly _push_ to your site (vs _fork_ & _pull request_)

- **auditors**: email bbest@gmail.com to ensure you get announcements via GauchoSpace

### `setwd()`

As mentioned in [wk03_dplyr](./wk03_dplyr.html#where-am-i?-getting-around-in-the-command-line), the working directory when knitting your Rmarkdown file is always the folder in which it is contained, eg for `env-info/students/bbest.Rmd` the working directory is `env-info/students`. This may be different from your R Console in RStudio which defaults to the working directory to the top level folder of your project, ie `env-info`. To get the two to be the same as you test code in the Console before knitting the Rmarkdown to HTML, I **REQUIRE** you to insert the following R chunk near the beginning (but after the Rmarkdown front matter surrounded by the rows `---`) of your `env-info/students/<user>.Rmd`:

```{r setwd students}
# set working directory if has students directory and at R Console (vs knitting)
if ('students' %in% list.files() & interactive()){
    setwd('students' )
}

# ensure working directory is students
if (basename(getwd()) != 'students'){
  stop(sprintf("WHOAH! Your working directory is not in 'students'!\n   getwd(): %s", getwd()))
}
```

This then ensures that "relative" paths will work the same in your R Console as when knitting your Rmarkdown to HTML. For instance:

```{r}
# absolute: /Users/bbest/github/env-info/students/data/bbest_ports-bc.csv
d = read.csv('./data/bbest_ports-bc.csv')     # ./data is child of students

# absolute: /Users/bbest/github/env-info/data/r-ecology/surveys.csv
d = read.csv('../data/r-ecology/surveys.csv') # ../data is sibling of students
```

The first path uses this folder `.` since that `data` folder is a "child" of the `students` folder, whereas the second path backs up a folder `..` before descending into the other `data` folder that is a "sibling" of the `students` folder.

### Assignment (Individual)

For the data wrangling portion of today, append the header `## 4. Tidying up Data` to your `env-info/students/<user>.Rmd` and include R chunks to run the demo below and give yourself the opportunity to try out possibilities with the code. Please set aside another header below this section `## 4. Answers and Tasks` where you answer questions and perform tasks on applying the functions to the CO<sub>2</sub> dataset as R chunks. Please include the question or task above the R chunk or answer.

You'll find it easiest to copy and paste the demo portion from [`env-info/wk04_tidyr.Rmd`](https://raw.githubusercontent.com/ucsb-bren/env-info/gh-pages/wk04_tidyr.Rmd) but will need to understand this material enough to apply to the questions and tasks.

You will want to synchronize with `ucsb-bren/env-info` (ie _pull request_ `ucsb-bren/env-info` to `<user>/env-info`, _merge_ the pull request in `<user>/env-info`, and _pull_ to update your local machine), in order to successfully knit your `env-info/students/<user>.Rmd`. The Rmarkdown below expects  `env-info/wk04_tidyr/img/data-wrangling-cheatsheet_tidyr.png` and `env-info/data/co2_europa.xls` which are in the updated `ucsb-bren/env-info`.

## data

The R chunks explaining the `dplyr` and `tidyr` functions below are pulled from the excellent [**wrangling-webinar.pdf**](wk03_dplyr/wrangling-webinar.pdf) presentation, which you should consult as you execute (see shortcuts in [rstudio-IDE-cheatsheet.pdf](refs/cheatsheets/rstudio-IDE-cheatsheet.pdf)). 

### EDAWR

```{r EDAWR}
# install.packages("devtools")
# devtools::install_github("rstudio/EDAWR")
library(EDAWR)
help(package='EDAWR')
?storms    # wind speed data for 6 hurricanes
?cases     # subset of WHO tuberculosis
?pollution # pollution data from WHO Ambient Air Pollution, 2014
?tb        # tuberculosis data
View(storms)
View(cases)
View(pollution)
```

### slicing

```{r traditional R slicing}
# storms
storms$storm
storms$wind
storms$pressure
storms$date

# cases
cases$country
names(cases)[-1]
unlist(cases[1:3, 2:4])

# pollution
pollution$city[c(1,3,5)]
pollution$amount[c(1,3,5)]
pollution$amount[c(2,4,6)]

# ratio
storms$pressure / storms$wind
```

```{r dplyr on storms}
# better yet
library(dplyr)

pollution %>%
  filter(city != 'New York') %>%
  mutate(
    ratio = pressure / wind)
```


## tidyr

Two main functions: gather() and spread() 

```{r tidyr}
# install.packages("tidyr")
library(tidyr)
?gather # gather to long
?spread # spread to wide
```

### `gather`

```{r gather}
cases
gather(cases, "year", "n", 2:4)
gather(cases, "year", "n", -1)
```

### `spread`

```{r spread}
pollution
spread(pollution, size, amount)
```

Other functions to extract and combine columns...

### `separate`

```{r separate}
storms
storms2 = separate(storms, date, c("year", "month", "day"), sep = "-")
```

### `unite`

```{r unite}
storms2
unite(storms2, "date", year, month, day, sep = "-")
```

**Recap: tidyr**:

- A package that reshapes the layout of data sets.

- Make observations from variables with `gather()` Make variables from observations with `spread()`

- Split and merge columns with `unite()` and `separate()`

From the [data-wrangling-cheatsheet.pdf](./refs/cheatsheets/data-wrangling-cheatsheet.pdf):

  ![](wk04_tidyr/img/data-wrangling-cheatsheet_tidyr.png)

### tidy CO<sub>2</sub> emissions

_**Task**. Convert the following table [CO<sub>2</sub> emissions per country since 1970](http://edgar.jrc.ec.europa.eu/overview.php?v=CO2ts1990-2014&sort=des9) from wide to long format and output the first few rows into your Rmarkdown. I recommend consulting `?gather` and you should have 3 columns in your output._

```{r read co2}
library(dplyr)
library(readxl) # install.packages('readxl')

# xls downloaded from http://edgar.jrc.ec.europa.eu/news_docs/CO2_1970-2014_dataset_of_CO2_report_2015.xls
xls = '../data/co2_europa.xls'

print(getwd())
co2 = read_excel(xls, skip=12)
co2
```

_**Question**. Why use `skip=12` argument in `read_excel()`?_

## dplyr

A package that helps transform tabular data

```{r dplyr}
# install.packages("dplyr")
library(dplyr)
?select
?filter
?arrange
?mutate
?group_by
?summarise
```

See sections in the [data-wrangling-cheatsheet.pdf](./refs/cheatsheets/data-wrangling-cheatsheet.pdf):

- Subset Variables (Columns), eg `select()`
- Subset Observations (Rows), eg `filter()`
- Reshaping Data - Change the layout of a data set, eg `arrange()`
- Make New Variables, eg `mutate()`
- Group Data, eg `group_by()` and `summarise()`

### `select`

```{r select}
storms
select(storms, storm, pressure)
storms %>% select(storm, pressure)
```

### `filter`

```{r filter}
storms
filter(storms, wind >= 50)
storms %>% filter(wind >= 50)

storms %>%
  filter(wind >= 50) %>%
  select(storm, pressure)
```

### `mutate`

```{r mutate}
storms %>%
  mutate(ratio = pressure / wind) %>%
  select(storm, ratio)
```

### `group_by`

```{r group_by}
pollution
pollution %>% group_by(city)
```

### `summarise`

```{r summarise}
# by city
pollution %>% 
  group_by(city) %>%
  summarise(
    mean = mean(amount), 
    sum = sum(amount), 
    n = n())

# by size
pollution %>% 
  group_by(size) %>%
  summarise(
    mean = mean(amount), 
    sum = sum(amount), 
    n = n())
```

note that `summarize` synonymously works

### `ungroup`

```{r ungroup}
pollution %>% 
  group_by(size)

pollution %>% 
  group_by(size) %>%
  ungroup()
```

### multiple groups

```{r multiple groups}
tb %>%
  group_by(country, year) %>%
  summarise(cases = sum(cases))
  summarise(cases = sum(cases))
```

**Recap: dplyr**:

- Extract columns with `select()` and rows with `filter()`

- Sort rows by column with `arrange()`

- Make new columns with `mutate()`

- Group rows by column with `group_by()` and `summarise()`

See sections in the [data-wrangling-cheatsheet.pdf](./refs/cheatsheets/data-wrangling-cheatsheet.pdf):

- Subset Variables (Columns), eg `select()`

- Subset Observations (Rows), eg `filter()`

- Reshaping Data - Change the layout of a data set, eg `arrange()`

- Make New Variables, eg `mutate()`

- Group Data, eg `group_by()` and `summarise()`


### summarize CO<sub>2</sub> emissions

_**Task**. Report the top 5 emitting countries (not World or EU28) for **2014** using your long format table. (You may need to convert your year column from factor to numeric, eg `mutate(year = as.numeric(as.character(year)))`. As with most analyses, there are multiple ways to do this. I used the following additional  functions: `filter`, `arrange`, `desc`, `head`)_. 

_**Task**. Summarize the **total** emissions by country  (not World or EU28) across years from your long format table and return the top 5 emitting countries. (As with most analyses, there are multiple ways to do this. I used the following functions: `filter`, `group_by`, `summarize`, `sum`, `arrange`, `desc`, `head`)_. 


## joining data

Next week, we'll do a bit on joining data.

## References

### Data Wrangling in R

- [Data Wrangling (dplyr, tidyr) cheat sheet]({{ site.baseurl }}/refs/cheatsheets/data-wrangling-cheatsheet.pdf)
- [wrangling-webinar.pdf](wrangling-webinar.pdf)