students/knboysen.Rmd

---
title: "Kristen Boysen (username:knboysen) Site"
author: "Kristen Boysen"
date: "January 15, 2016"
output:   
  html_document:
    toc: true
    toc_depth: 2
---

```{r echo=F}
# set working directory if has child directory
dir_child = 'students' 
if (dir_child %in% list.files()){
  if (interactive()) {  
    # R Console
    setwd(dir_child)
  } else {              
    # knitting
    knitr::opts_knit$set(root.dir=dir_child)  
  }
}
```

## Content
        
I am interested on climate impacts on a large landscape scale, how ecosystems and species will adapt, and how we as managers can help improve the resilience of these environments. 
        
## Techniques

There are several techniques I am looking forward to learning:

1. Dealing with large datasets
2. Using Git to collaborate with other researchers
3. Visualizing and sharing data analytics
        
## Data
        
I did make up some data today during class:
  
```{r echo= FALSE}
impacts<- c(1,1,3, 3,4,5,6, 6, 6, 7)
temp<- c(20,22, 24, 25.5,28, 28, 32, 31, 29, 32)

sea_data<-data.frame(impacts, temp)
colnames(sea_data)<- c("sea_rise", "ave_temp")

model<-lm(sea_rise~ ave_temp, data=sea_data)
plot(sea_data$ave_temp, sea_data$sea_rise, ylab = "Sea Level Rise (m)", xlab="Temp (C)", main="Invented Sea Level Rise with Rising Temperatures")
abline(model, col="red")
```

I do think United National Environmental Programme will have some interesting data, as will NOAA.  
 
## Penguins
Here is a photo of a molting, juvenile Gentoo penguin (*Pygoscelis papua*) on [King George Island](https://www.google.com/maps/place/King+George+Island,+Antarctica/@-62.0757944,-58.8658154,9z/data=!3m1!4b1!4m2!3m1!1s0xbc738f0fdffd7975:0x759c74bafe566d71), Antarctica. I took this photo in February, 2011 while working for NOAA. 

<img src="images/knboysen_gentoo.jpg" width="400px"/>

##Sea Level Rise Data from UNEP

I downloaded data from the United National Environmental Programme's [Environmental Data Explorer](http://ede.grid.unep.ch/) that contains the percent of total land area that is below 5m elevation for each country. Below are summary statistics of that data:

```{r}
# read csv
sea_level_unep_kb <- read.csv("data/knboysen_sea_level_unep.csv")
      
# output histogram
hist(sea_level_unep_kb$Percent_land_under_5m, xlab="Percent Land Below 5m", main="Histogram of UNEP Elevation Data", col ="blue")
```

This data set includes `r nrow(sea_level_unep_kb)` countries. Below is a table of the `r nrow(subset)` countries that are completely (100% of land area) below 5m of elevation
```{r}
subset<- subset(sea_level_unep_kb, Percent_land_under_5m==100)
print(subset[,c(1,3,15)])
```

The mean percent of land under 5m is `r mean(na.omit(sea_level_unep_kb$Percent_land_under_5m))`%. 

##Data Wrangling

Install Packages
```{r, eval=F}
# Run this chunk only once in your Console
# Do not evaluate when knitting Rmarkdown

# list of packages
pkgs = c(
  'readr',        # read csv
  'readxl',       # read xls
  'dplyr',        # data frame manipulation
  'tidyr',        # data tidying
  'nycflights13', # test dataset of NYC flights for 2013
  'gapminder')    # test dataset of life expectancy and popultion

# install packages if not found
for (p in pkgs){
  if (!require(p, character.only=T)){
    install.packages(p)
  }
}
```


##utils::read.csv
Traditionally, you would read a CSV like so:
```{r}
d = read.csv('../data/r-ecology/species.csv')
d
head(d) ##species_ID is a factor
summary(d)
```

readr::read_csv
Better yet, try read_csv:
```{r}
library(readr)

d = read_csv('../data/r-ecology/species.csv')
d
head(d) ##now the specices_ID is a "character". why does this matter, BB?!
summary(d)
```

dplry::tbl_df
Now convert to a dplyr table:
```{r}
library(readr)
library(dplyr)

d = read_csv('../data/r-ecology/species.csv') %>%
  tbl_df()
d
head(d)
summary(d)
glimpse(d)
```

```{r}
b=read_csv('C:/Users/Kristen/Documents/BREN/winter16/env_info/env-info/data/r-ecology/surveys.csv') %>%
  select(species_id, year) %>%
  filter(species_id== "AB") %>%
  group_by(species_id, year) %>%
  summarize(count = n())

```

###elegance with dplry 
week 3 individual assignment

```{r}
library(readr)
library(dplyr)
#read is csv
surveys = read_csv('../data/r-ecology/surveys.csv')

surveys %T>% ##tee operator is good for printing or plotting that wouldn't ouput a return usually
  glimpse() %>%
  select(species_id, year) %>% #selected columns
  filter(species_id == 'NL') %>% ##with specific row entries
  group_by(year) %>%
  summarize(n=n()) %T>%  ##summarize n counts
  glimpse() %>% ##view the table before writing it. it seems to work. 31 "NL" counts in 1977
  write_csv('data/surveys_kboysen.csv') ##hooray it works!
  
  
```

###Wrangling Webinar
####Piping

```{r, echo= F}
library(devtools)
devtools::install_github("rstudio/EDAWR")
library(EDAWR)
```

```{r}
library(tidyr)
library(dplyr)

select(tb, child:elderly) ##verses
tb %>% select(child:elderly)

#yay piping
```


###tidyr
```{r}
library(tidyr)
##cases, a dataset about TB

gather(cases, "year", "n", 2:4) ##cases is the dataframe we are reshaping, "year"/"n" are the new column names, and we are collapsing columns 2-4

##the opposite function is "spread"
#use "pollution" dataset

spread(pollution, size, amount)
#creates a new columns for each size (large & small), and populates it with the "amount"


##separate!
storms2<- separate(storms, date, c("year", "month", "day"), sep = "-") ##separates the date of the storm by the year month day (dash separates the original date) this is very cool

##unite!
unite(storms2, "date", year, month, day, sep = "-") ## put it back together again


```

###Dplyr

```{r}
library(nycflights13)

###dplyr ways to access info
##select() extracts variables
##filter() extracts exisiting observations
#mutate()  derive new variables
##summarise() ##change the unit of analysis

## - selects everything but
## : selects range
##contains, ends_with, everthing(), matches(), num_range(), one_of(), starts_with()

filter(storms, wind>=50, 
       storm %in% c("Alberto", "Alex", "Allison"))

#mutate

mutate(storms, ratio =pressure/wind)
mutate(storms, ratio =pressure/wind, inverse= ratio^-1)
#useful mutate functions

mutate(storms, cummean(pressure)) ##creates new column with cummulative mean
mutate(storms, percent_rank(wind)) ##ranks them 0-1
mutate(storms, dense_rank(wind)) ##ranks them 1(low)-6
mutate(storms, min_rank(wind))

##window functions
## i don't get these


##summarise

pollution %>% summarise(median=median(amount), variance=var(amount))

pollution %>% summarise(mean= mean(amount), sum= sum(amount), n=n())

##arrange() sorts data frame

##desc makes it descending. default is ascending

##piping again

storms %>% filter(wind >=50) 
#vs
filter(storms, wind>=50)

storms %>%
  mutate(ratio = pressure/wind) %>%
  select(storm, ratio)

#ctrl shift m!

# %>% 

```
##Unit of analysis

```{r}
#group by 
pollution %>%  group_by(city)
pollution %>%  group_by(city) %>% summarise(mean= mean(amount), sum= sum(amount), n=n())

pollution %>%  group_by(size) %>% summarise(mean=mean(amount))

pollution %>%  ungroup()

##tb example

```
##joining data

```{r}
y<- data.frame(x1=c("a","b","c"), x2=(c(1,2,3)))
y[] <- lapply(y, as.character)
z<-data.frame(x1=c("b","c","d"),x2=(c(2,3,4)))

z[] <- lapply(z, as.character)

##is this the best way to change factors to characters?
##are we losing info by changes the #s (x2) into characters?
##bind_cols
bind_cols(y,z)
##bind_rows
bind_rows(y,z)
##Union
union(y,z)
##intersect
intersect(y,z)
##setdiff
setdiff(y,z)

##d doesn't show up here? 

left_join(songs, artists, by="name")

inner_join(songs, artists, by="name")
```

##Week 04-- Tidyr
### EDAWR

```{r EDAWR}
#install.packages("devtools")
#devtools::install_github("rstudio/EDAWR")
#library(EDAWR)
#help(package='EDAWR')
#?storms    # wind speed data for 6 hurricanes
#?cases     # subset of WHO tuberculosis
#?pollution # pollution data from WHO Ambient Air Pollution, 2014
#?tb        # tuberculosis data
#Viewd(storms)
#View(cases)
#View(pollution)
```

### slicing

```{r traditional R slicing, eval=F}
# storms
storms$storm
storms$wind
storms$pressure
storms$date

# cases
cases$country
names(cases)[-1]
unlist(cases[1:3, 2:4])

# pollution
pollution$city[c(1,3,5)]
pollution$amount[c(1,3,5)]
pollution$amount[c(2,4,6)]
##single equals sign sets the value, double equals sign searches for things with that value
pollution %>% 
  filter(city != 'New York')

# ratio
storms$pressure / storms$wind
```

## tidyr

Two main functions: gather() and spread() 

```{r tidyr, eval=F}
# install.packages("tidyr")
library(tidyr)
?gather # gather to long
?spread # spread to wide
```

### `gather`

```{r gather, eval=F}
cases
gather(cases, "year", "n", 2:4)
gather(cases, "year", "n", -country)
cases %>% 
  gather("year", "n", -country) %>% 
  filter(year %in% c(2011,2013), 
         country %in% c('FR', 'US'))

## ! flips it all: 
cases %>%
  gather("year", "n", -country) %>% 
  filter(year %in% c(2011,2013), 
         !country %in% c('FR', 'US')) ##this selects all things NOT FR, US :-) 
```

### `spread`

```{r spread, eval=F}
pollution
spread(pollution, size, amount)
##data fram, colmn to use as keys, :amount' fills the cells 
```

Other functions to extract and combine columns...

### `separate`

```{r separate, eval=F}
storms
storms2 <- separate(storms, date, c("year", "month", "day"), sep = "-")

storms %>% 
  mutate(date_str= as.character(date))

storms3 <- separate(storms, date, c("year", "month", "day"), sep = c(4,6))
```

### `unite`

```{r unite, eval=F}
storms2
unite(storms2, "date", year, month, day, sep = "-")
```

**Recap: tidyr**:

- A package that reshapes the layout of data sets.

- Make observations from variables with `gather()` Make variables from observations with `spread()`

- Split and merge columns with `unite()` and `separate()`

From the [data-wrangling-cheatsheet.pdf](./refs/cheatsheets/data-wrangling-cheatsheet.pdf):


### tidy CO<sub>2</sub> emissions
##Assignment 4

_**Task**. Convert the following table [CO<sub>2</sub> emissions per country since 1970](http://edgar.jrc.ec.europa.eu/overview.php?v=CO2ts1990-2014&sort=des9) from wide to long format and output the first few rows into your Rmarkdown. I recommend consulting `?gather` and you should have 3 columns in your output._

```{r read co2, eval=F}
library(readxl) # install.packages('readxl')

url = 'http://edgar.jrc.ec.europa.eu/news_docs/CO2_1970-2014_dataset_of_CO2_report_2015.xls'
xls = '../data/co2_europa.xls'

print(getwd())
if (!file.exists(xls)){
  download.file(url, xls, method='internal')
}
co2 = read_excel(xls, skip=12)
co2
```

_**Question**. Why use `skip=12` argument in `read_excel()`?_
The 'skip-12' command in the read_excel because the first 12 rows of the excel file are metadata-- titles, information about where the data came from etc. 


```{r gather ERROR}
co2_long<- co2 %>% 
  gather("Year", "CO2", 2:46)
 
head(co2_long)
###this code, though it works in R, returns the error "Error in eval(expr, envir, enclos): not compatible with STRSXP calls: <Anonymous>...<Anonymous> -> select_Vars_ -> combine_vars ->.Call

##halp!
```

### summarize CO<sub>2</sub> emissions

_**Task**. Report the top 5 emitting countries (not World or EU28) for 2014 using your long format table. (You may need to convert your year column from factor to numeric, eg `mutate(year = as.numeric(as.character(year)))`. As with most analyses, there are multiple ways to do this. I used the following functions: `filter`, `arrange`, `desc`, `head`)_. 

```{r Task 2}
co2_long %>% 
  mutate(Year = as.numeric(as.character(Year))) %>%
  filter(Year==2014) %>%
  filter(Country!="EU28", Country!="World") %>% 
  top_n(5, CO2) %>% 
  arrange(desc(CO2)) %>% 
  head

         #        Country  Year      CO2
          #           (chr) (dbl)    (dbl)
          #          China  2014 10540750
# United States of America  2014  5334530
#                    India  2014  2341897
#       Russian Federation  2014  1766427
#                    Japan  2014  1278922
  
```


_**Task**. Summarize the total emissions by country  (not World or EU28) across years from your long format table and return the top 5 emitting countries. (As with most analyses, there are multiple ways to do this. I used the following functions: `filter`, `arrange`, `desc`, `head`)_. 

```{r Task 3}
co2_long %>% 
  filter(Country!="EU28", Country!="World") %>% 
  group_by(Country) %>% 
  summarise(total_emissions=sum(CO2)) %>% 
  arrange(desc(total_emissions)) %>% 
  head()

#                   Country total_emissions
                  #   (chr)           (dbl)
#United States of America       231948899
#                    China       174045927
#       Russian Federation        81242427
#                    Japan        51276329
#                  Germany        43382205
#                    India        39004218

```