students/jebone.Rmd

---
title: "jebone"
author: "Jennifer Bone"
date: "January 15, 2016"
output: 
  html_document:
    toc: true
    toc_depth: 2
---
## Content
        
*Can citizen monitoring improve municipal waste managment in Uganda?* 
Specifically, we are interested in developing a data set that automatically updates as new data becomes available.

 
![](images/jebone_muganda.png)

* **More about:** 
    + [citizen science](https://en.wikipedia.org/wiki/Municipal_solid_waste)

## Techniques
        
Data management and visualization
        
## Data
        
Survey data has been collected from various participants in Uganda.

```{r}
d = read.csv('data/jebone_ugandaSMS.csv')
      
summary(d)
```

[**Citizen Monitoring**](http://<organization>.github.io)

##Data Wrangling
```{r,eval=FALSE}
# present working directory
getwd()

# change working directory
setwd('.')

# list files
list.files()

# list files that end in '.jpg'
list.files(pattern=glob2rx('*.jpg'))

# file exists
file.exists('test.png')
```
##install packages
```{r,eval=FALSE}
##Install Packages
# Run this chunk only once in your Console
# Do not evaluate when knitting Rmarkdown

# list of packages
pkgs = c(
  'readr',        # read csv
  'readxl',       # read xls
  'dplyr',        # data frame manipulation
  'tidyr',        # data tidying
  'nycflights13', # test dataset of NYC flights for 2013
  'gapminder')    # test dataset of life expectancy and popultion

# install packages if not found
for (p in pkgs){
  if (!require(p, character.only=T)){
    install.packages(p)
  }
}
```
##utils::read.csv
```{r,eval=FALSE}
d = read.csv('../data/r-ecology/species.csv')
d
head(d)
summary(d)
```
##readr::read_csv
```{r,eval=FALSE}
library(readr)

d = read_csv('../data/r-ecology/species.csv')
d
head(d)
summary(d)
```
##dplry::tbl_df
```{r,eval=FALSE}
library(readr)
library(dplyr)

d = read_csv('../data/r-ecology/species.csv') %>%
  tbl_df()
d
head(d)
summary(d)
glimpse(d)
```
##dplry loosely
```{r,eval=FALSE}
library(readr)
library(dplyr)

read_csv('../data/r-ecology/surveys.csv') %>%
  select(species_id,year) %>%
  filter(species_id == "DM") %>%
  group_by(species_id,year) %>%
  summarize(count=n())

d
head(d)
summary(d)
glimpse(d)
```

##Wrangling Data
```{r}
#install.packages("devtools")
#devtools::install_github("rstudio/EDAWR")
#?storms
#?cases

#gather(cases,"year","n",2:4)
#spread(pollution,size,amount)
#storms2<-separate(storms,date,c("y","m","d"),sep="-")
#unite(storms2,"date", y, m, d, sep="-") 


#install.packages("dplyr")
#install.packages("nycflights123")

#filter(storms,wind>=50)
#mutate(storms,ratio=pressure/wind, #inverse=ratio^-1)
#arrange(storms,desc(wind))
```

## 4. Tidying up Data
```{r}
## data

### EDAWR

```{r EDAWR, eval=F}
install.packages("devtools")
# devtools::install_github("rstudio/EDAWR")
library(EDAWR)
help(package='EDAWR')
?storms    # wind speed data for 6 hurricanes
?cases     # subset of WHO tuberculosis
?pollution # pollution data from WHO Ambient Air Pollution, 2014
?tb        # tuberculosis data
View(storms)
View(cases)
View(pollution)
```

### slicing

```{r traditional R slicing, eval=F}
# storms
storms$storm
storms$wind
storms$pressure
storms$date

# cases
cases$country
names(cases)[-1] #removes the first name
unlist(cases[1:3, 2:4])

# pollution
pollution$city[c(1,3,5)]
pollution$amount[c(1,3,5)]
pollution$amount[c(2,4,6)]

# ratio
storms$pressure / storms$wind
```

## tidyr

Two main functions: gather() and spread() 

```{r tidyr, eval=F}
# install.packages("tidyr")
library(tidyr)
library(dplyr)
?gather # gather to long
?spread # spread to wide
```

### `gather`

```{r gather, eval=F}
cases
gather(cases, "year", "n", 2:4) # if use "-2" instead of 2:4 it will take out the second column

gather(cases, "year", "n", -country)

cases %>%
  gather("year", "n", -country)%>%
  filter(
    year %in% c(2011,2013),
    !country %in% c('FR','US')) 
```

### `spread`

```{r spread, eval=F}
pollution
spread(pollution, size, amount)
```

Other functions to extract and combine columns...

### `separate`

```{r separate, eval=F}
storms
storms2 = separate(storms, date, c("year", "month", "day"), sep = "-") #use dash as the seperator
storms2

storms2 = separate(storms, date, c("year", "month", "day"), sep = c(4,6)) #use c(4,6) to indicate where you want the column separation to occur
storms2

storms%>%
  mutate(date_str=as.character(date))
```

### `unite`

```{r unite, eval=F}
storms2
unite(storms2, "date", year, month, day, sep = "-")
```

**Recap: tidyr**:

- A package that reshapes the layout of data sets.

- Make observations from variables with `gather()` Make variables from observations with `spread()`

- Split and merge columns with `unite()` and `separate()`


## dplyr

A package that helps transform tabular data

```{r dplyr, eval=F}
# install.packages("dplyr")
library(dplyr)
?select
?filter
?arrange
?mutate
?group_by
?summarise
```

See sections in the [data-wrangling-cheatsheet.pdf](./refs/cheatsheets/data-wrangling-cheatsheet.pdf):

- Subset Variables (Columns), eg `select()`
- Subset Observations (Rows), eg `filter()`
- Reshaping Data - Change the layout of a data set, eg `arrange()`
- Make New Variables, eg `mutate()`
- Group Data, eg `group_by()` and `summarise()`

### `select`

```{r select, eval=F}
storms
select(storms, storm, pressure)
storms %>% select(storm, pressure)
```

### `filter`

```{r filter, eval=F}
storms
filter(storms, wind >= 50)
storms %>% filter(wind >= 50)

storms %>%
  filter(wind >= 50) %>%
  select(storm, pressure)
```

### `mutate`

```{r mutate, eval=F}
storms %>%
  mutate(ratio = pressure / wind) %>%
  select(storm, ratio)
```

### `group_by`

```{r group_by, eval=F}
pollution
pollution %>% group_by(city)
```

### `summarise`

```{r summarise, eval=F}
# by city
pollution %>% 
  group_by(city) %>%
  summarise(
    mean = mean(amount), 
    sum = sum(amount), 
    n = n())

# by size
pollution %>% 
  group_by(size) %>%
  summarise(
    mean = mean(amount), 
    sum = sum(amount), 
    n = n())
```

note that `summarize` synonymously works

### `ungroup`

```{r ungroup, eval=F}
pollution %>% 
  group_by(size)

pollution %>% 
  group_by(size) %>%
  ungroup()
```

### multiple groups

```{r multiple groups, eval=F}
tb %>%
  group_by(country, year) %>%
  summarise(cases = sum(cases))
  summarise(cases = sum(cases))
```


## Answers and tasks
### tidy CO<sub>2</sub> emissions

_**Task**. Convert the following table [CO<sub>2</sub> emissions per country since 1970](http://edgar.jrc.ec.europa.eu/overview.php?v=CO2ts1990-2014&sort=des9) from wide to long format and output the first few rows into your Rmarkdown. I recommend consulting `?gather` and you should have 3 columns in your output._

```{r read co2}
# set working directory if has students directory and at R Console (vs knitting)
if ('students' %in% list.files() & interactive()){
    setwd('students' )
}

# ensure working directory is students
if (basename(getwd()) != 'students'){
  stop(sprintf("WHOAH! Your working directory is not in 'students'!\n   getwd(): %s", getwd()))
}

library(readxl)# install.packages('readxl')

url = 'http://edgar.jrc.ec.europa.eu/news_docs/CO2_1970-2014_dataset_of_CO2_report_2015.xls'
xls = '../data/co2_europa.xls'

print(getwd())
if (!file.exists(xls)){
  download.file(url, xls)
}
co2 = read_excel(xls, skip=12)
#View(co2)
names(co2)<-c('country', 1970:2014)
#View(co2)
```

```{r}
library(dplyr)
library(tidyr)
#table transformation
d=co2%>%
  gather('year','emissions', 2:46)%>%
  arrange(country, year)
#View(d)
DT::datatable(d)
```

_**Question**. Why use `skip=12` argument in `read_excel()`?_
It skips the first 12 rows in the excel spreadsheet. The first 12 rows are populated with information that we are not interested in using.

### summarize per capita CO<sub>2</sub> emissions 

You'll join the [gapminder](https://github.com/jennybc/gapminder) datasets to get world population per country.

_**Task**. Report the top 5 emitting countries (not World or EU28) for 2014 using your long format table. (You may need to convert your year column from factor to numeric, eg `mutate(year = as.numeric(as.character(year)))`. As with most analyses, there are multiple ways to do this. I used the following functions: `filter`, `arrange`, `desc`, `head`)_. 
```{r top 5 countries 2014}
mutate(d,year=as.numeric(as.character(year)))
twentyfourteen=d%>%
  filter(year %in% c('2014')) %>%
  filter(!country %in% c('World','EU28'))%>%
  arrange(desc(emissions))%>%
  head(n=5)
#View(twentyfourteen)
DT::datatable(twentyfourteen)
```

The top 5 country emitters in 2014 were : China, USA, India, Russia, Japan. 

_**Task**. Summarize the total emissions by country  (not World or EU28) per capita across years from your long format table and return the top 5 emitting countries. (As with most analyses, there are multiple ways to do this. I used the following functions: `filter`, `arrange`, `desc`, `head`)_. 
```{r top 5 countries all years}
mutate(d,year=as.numeric(as.character(year)))
f=d%>%
  filter(!country %in% c('World','EU28')) %>%
  group_by(country)%>%
  summarise(TotalEmissions=sum(emissions))%>%
  arrange(desc(TotalEmissions))%>%
  head(n=5)
#View(f)
DT::datatable(f)
```

The top 5 country emitters are : USA, China, Russia, Japan, Germany. 
```{r gapminder, eval=F}
library(gapminder) # install.packages('gapminder')
```
```