Learn-R-tutorial.Rmd

---
title: "Learn-R-tutorial"
author: "R-Ladies-Vancouver"
date: "April 2, 2018"
output: 
  html_document:
    keep_md: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Overview

---

Notes from a hands-on 3 hour workshop for participants with no prior knowledge of programming skills. 

Learning materials used in this tutorial were adapted form:

* [Intro into R by Research Bazar](https://resbaz.github.io/RLadies-Intro-R-workshop/)
* [Software Carpentry - "R for reproducible scientific analysis"](https://swcarpentry.github.io/r-novice-gapminder/)
* [2-hour intro into R by Mine Cetinkaya-Rundel](https://github.com/mine-cetinkaya-rundel/rworkshop-mem)

**Required**

* a laptop
* R and R studio installed. [Click here](https://github.com/R-Ladies-Vancouver/April2018-Learn-R-Beginner/blob/master/Setup.md) for instructions. 
* we will also need be installing the  `dplyr` and  `ggplot2`packages during the tutorial

**Other resources**

* [Slides](https://github.com/R-Ladies-Vancouver/April2018-Learn-R-Beginner/blob/master/Slides/2018-April_event-presentation.pdf)

* The open-source dataset with passanger onboard a Titanic from [Kaggle website](https://www.kaggle.com/c/titanic). Download the csv file with the dataset [here](https://drive.google.com/open?id=1iK6tiBsb4cabyi7mP5FLCH6LWYs4hOSI). We will use this learn dplyr and ggplot2 package

**Topics covered**

1. RStudio orientation & good practices for data analysis   
2. Basics of coding in R  
3. Working with functions and packages  
4. Loading and inspecting the titanic.csv dataset
5. Basics of data wrangling with the `dplyr` package  
6. Basics of data plotting with the `ggplot2` package  

---

## What is R and RStudio?

What is R and why learn it? 

* one of the leading languages in statistics and data science
* open-source and free
* powerfull for data wrangling and visualization
* large library of tools (packages and functions) for diverse applications (e.g. [Bioconductor for genomic research](https://www.bioconductor.org/), [rOpenGov for social sciences](http://ropengov.github.io/about/))
* increasing number of organizations use R, find out more in this [StackOverflow post](https://stackoverflow.blog/2017/10/10/impressive-growth-r/)
* with a programming language such as R you can document the process of your data analysis, making it easier for you or another user to reproduce. This is challenging using tools such as an excell spreadsheet, where calculations are hidden in individual cells.
* R has a highly active, supportive and fast-growing user and developer community. 


What is RStudio?

* powerful and productive user interface for R
* free and open-source. 

---

## Data analysis workflow


As adapted from R for Data Science by Hadley Wickham, the maker of the many popular R packages such as `dplyr` and `ggplot2` packages we will be working with today

![](https://github.com/R-Ladies-Vancouver/April2018-Learn-R-Beginner/blob/master/Img/2018-April_event-presentation.jpg)


---

## RStudio orientation & best practices for data analysis

This is what you should see when you launch RStudio:

* the interactive console (left), a version of R you are working with will be printed there every time you start a new session
* Environment/History (top right)
* Files/Plots/Packages/Help/Viewer

![](https://github.com/R-Ladies-Vancouver/April2018-Learn-R-Beginner/blob/master/Img/RStudio-launch.jpg)

---

## Setting yourself up for reproducible R work

Before we begin producing R code, we need to familiarize ourselves with where R work is going to be located on our computer.To find out, type `getwd()` into the console. This will prompt R to print out a path to a location on your computer where any scripts, plots, data you generate will be saved.

```{r }
getwd()

```


It is a good practice to keep all of your data analysis for a particular project in a single location. RStudio comes with a built-in framework called R projects. To create a new project in RStudio:

1. Click the “File” menu button, then “New Project”.
2. Click “New Directory”.
3. Click “Empty Project”.
4. Type in the name of the directory to store your project
5. Click the “Create Project” button.


A recommendation on how to organize your R project from [Good Enough Practices for Scientific Computing](https://github.com/swcarpentry/good-enough-practices-in-scientific-computing/blob/gh-pages/good-enough-practices-for-scientific-computing.pdf) as summarized by [Software Carpentry](https://swcarpentry.github.io/r-novice-gapminder/02-project-intro/)

1. Put each project in its own directory, which is named after the project.
2. Put text documents associated with the project in the doc directory.
3. Put raw data and metadata in the data directory, and files generated during cleanup and analysis in a results directory.
4. Put source for the project’s scripts and programs in the src directory, and programs brought in from elsewhere or compiled locally in the bin directory.
5. Name all files to reflect their content or function.

> Challenge

1. Create an Rproject called “R-beginner-workshop". Click **File**>**New Project**>**New directory**>**Empty project** 
2. Create a folder called "data"
3. Put the titanic.csv into folder data
4. Create a folder called "scr"

---

## Basics of R coding


You can clear your console with a command `ctr + L`

We can use R to do simple arithmetic: 

* division `/`
* multiple `*`
* addition `+`
* exponent `^`
* subtraction `-`

For instance, type the following in your console:

```{r }

5+5

10/5

```

The answer to this is printed below the command, preceded by [1]. Right now, none of our work exists anywhere besides the console. 

To create a script:

 1. Click File
 2. Click New File 
 3. Click R script
 4. Give your script a name

> Challenge

* Create a script called “R-beginner-code” in your current project and put it in `scr` folder
 
---

We can annotate a script with comments using the `#`. If you precede anything with this sign, R will ignore it.  For instance

```{r }

# to do a addition in R, we do the following
5+5

```

To store results or data we need to assign it a name, using assignment operator `<-`

```{r }

# now I tell R to store my height

height_cm <- 172

```

Now, position your cursor at the line of the above command and press `CTRL + ENTER` (Windows) or `COMMAND + ENTER` (Mac). In the environment (top right) you should see a variable `height_cm` and its value appear

It is not always possible to view what is stored in a variable by looking at your environment. To find out what value is stored in the variable `height_cm` simply type it in the new line of our script and run `CTRL + ENTER`/`COMMAND + ENTER`.

```{r }

# what is stored in height_cm?

height_cm

```

Note, R is case sensitive. If we were to type `Height_cm` we would get an error. Try it!

```{r }

# what is stored in height_cm?

# Height_cm

```

Say you wanted to convert your heigh to inches.

```{r }

# what is my height in inches?

height_cm * 0.39

# oh wait, I want to store this value!

height_inches <- height_cm * 0.39

# what is it again?

height_inches

```

If we wanted to clear the environment of the variables we have 2 options:

* rm(list = ls())
* start a new R session - essentially a refresh button. In RStudio click "Session" > "Restart R"

> Challenge

1. Create a variable called `mass_kg` with a value of 50
2. Convert `mass_kg` to mass in pounds (multiply by 2.2) and store the answer in `mass_lbs`


## What are functions and packages?

**Functions:**

* fundamental building blocks of R
* a list of "build-in"R functions (come with installation) can be found [here](http://www.sr.bham.ac.uk/~ajrs/R/r-function_list.html)
* functions are used by first specifying first their name followed by `()`, into which we put arguments that a given function takes.

`log()` is an example of a function to compute a logarithm.

You can also get information on how to use a function in R studio by typing

`?log()`

This will promt a Help page in the bottom right pane to appear with description of a function, what arguments a given function uses and some examples of how to apply it.

```{r }

# find out log of height_cm
log(height_cm)

# find out exp of height_cm
exp(height_cm)

```


Objects can store more than one value, and this value can be
* numeric (e.g. 1,2, -3, .2)
* character (e.g."vancouver","victoria","nanaimo")
* locaical (TRUE or FALSE)

We will see shortly that an object can also be a large dataset with a variety of data types.

```{r }

# an object with few numeric values
heights_cm <- c(172, 143, 152, 180)

mean(heights_cm)

sd(heights_cm)

length(heights_cm)

```


**Packages**:

* Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and (often) sample data. (From: http://r-pkgs.had.co.nz)

* Packages need to be **installed** once only, therefore we don't put these commands into the script. To install a package we use a command `install.packages("package-name")`. Today, we will be working with two packages called the `dplyr` for data wrangling and `ggplot2`for data visualization

* Packages need to be **loaded** everytime we start a new R session. Typically this command goes on top of a script. To do so we use `library("package-name")` command


> Challenge

Install the `dplyr` and `ggplot2` packages.

Type the following into console. 

```{r}
# install.packages("ggplot2")
# install.packages("dplyr")
```

Now load these packages.

```{r}

library("ggplot2")
library("dplyr")
```


## Loading and inspecting the titanic.csv dataset

---

To load a csv file into RStudio we will use function `read.csv()`and assign our dataset a name (create an object). As a result, what will be created in the R environment is known as a DATAFRAME (a type of data structure in R) which has:

* variables of a data set as columns
* observations as rows
* and is rectangular

```{r}

library("dplyr")
library("ggplot2")

titanic <- read.csv("Data/titanic.csv")

# tibble: a refined print method that shows only first 10 rows and all columns fit the screen

titanic <- as_tibble(titanic)

```


Before we dive into extracting information from this dataset, it is important to know what variables we are working with. 

There are several functions that will provide information about the structure of the dataframe: `str()`, `head()` (part of base R) or `glimpse()`(part of DPLYR package).

Note the differences in the output in your R console when you run these three functions.

```{r}

head(titanic)
glimpse(titanic)
str(titanic)
```

It is also possible to view a dataframe in a spreadsheet-like view. Click on the `titanic` in your Environment to chek it out.

**What is in the titanic dataset?**


Variable | Definition    | Key                        | 
---------|:--------------|:-------------------------- |
survival |	Survival     |	0 = No, 1 = Yes           |
pclass   |	Ticket class |	1 = 1st, 2 = 2nd, 3 = 3rd |
sex	     |  Sex	         |                            |
Age      |	Age in years |                            |
sibsp    |	# of siblings / spouses aboard the Titanic	| |
parch    |	# of parents / children aboard the Titanic	| |
ticket   |	Ticket number	|                           |
fare     |	Passenger fare|                           |
cabin	   |  Cabin number  |                           |
embarked |	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton |         |


Other usefull commands to explore a dataset:

```{r}

names(titanic) # lists the variable names
dim(titanic) # dimentions of the dataset, it has 891 observations by 12 columns
nrow(titanic) # number of observations
summary(titanic) # gives me summary statistics for every column/variable

```

---

## Data wrangling with the `dplyr` package

The `dplyr` package is based on the concepts of functions as verbs that manipulate data frames, the core of dplyr lies in the 5 main verbs:

* `filter()`: pick rows matching criteria
* `select()`: pick columns by name
* `group_by()`: group rows based on columns
* `summarise()`: reduce groups to values
* `mutate()`: add new variables

Each one of these functions can be performed individually but the real power comes in when we string them together using the pipe (`%>%`) operator. This allows us to perform multiple operations in one hit and return a single tibble with our results at the end.

Let's explore some of these functions!

### select()

Use the `select()` function to extract specific variables from a dataframe. You can store variables of interest extracted with `select()` in new tibble should you want to have a small subset of the data to work on seperately.

Here's an example. 

```{r}
# this 'pipes' the titanic dataset to the select command and will print only the age, sex and survived columns
titanic %>%
  select(Age, Sex, Survived)
```

I am happy with the output, now let's create a seperate object containing just these 3 columns (aka variables)

```{r}
gender_survival <- titanic %>%
   select(Age, Sex, Survived)
```

There are various ways to use `select()` function. For instance, we can extract variables that start or end with certain letters or strings.

```{r}

# select columns that begin with A 
   titanic %>%
   select(starts_with("A"))

# select columns that end with s
   titanic %>%
   select(ends_with("s"))

```

> ### Challenge

Create a subset of titanic dataframe called `passanger_info` containing `Name`,`Age`,`Sex`

### filter()

The function `filter()` extracts data **row wise**. 

An example: Column (variable) **Sex** contains two categories: females and males. Let's extract all of the information (all variables) contained in the `titanic` dataset but only for females

```{r}
# extract all of the variables but only for females
titanic %>%
  filter(Sex == "female")
```

We can also do multiple levels of filtering.

```{r}

# extract all of the variables but only or females aged 20 and up
titanic %>% 
  filter(Sex == "female", Age > 20)

```

The trick with the `filter()` option is that we always want each statement to return a `True` or a `False`. 

> Challenge

Create a new object called 'males' containing only male passengers and only the columns Age, Sex, Name and Fare. What dimensions does your object have?

### group_by() and summarize()
 
So far we learnt how to extract pieces of data from the `titanic` dataset, but we still have lots of rows of data. How do we summarize it so that we can find out something useful, say the average age of people who survived the tragedy? This is where two powerful functions come in:  `group_by` and `summarize()`.

To perform operations on the `titanic` dataset, we first need to specify what variable we want to summarize with the `group_by`function.

In the example below we specify that the operation will be grouped by Sex for the `titanic` data. Because we have 2 categories in `Sex`, the female and male, any calculations we perform after the `group_by` will produce one value for females and one value for males.


```{r}
 
titanic %>%
  group_by(Sex)

```

Not much will happen when running `group_by` by itself, although in the console above the dataframe that was printed we can see a new line #Groups: Sex [2]. To perform an operation we need to use `group_by` in conjunction with `summarise()`

With summarize we can use the functions we explored at the beginning of the workshop (`mean`,`sd`,`median`,`n()`) or we can perform calculations like we would do in a cell of an excel spreadhseet. 

To calculate the mean age for females and males in the `titanic` dataset:

1. Group your data by variable **Sex**
2. Define what to average with the `mean()` function and put this inside the `summarize()`
3. Note, there are some missing values (NA's) in the variable **Age**, we must tell R to ignore NA's in those calculations using `na.rm = TRUE` (i.e. na remove = yes)

```{r}
 
titanic %>%
  group_by(Sex) %>%
  summarize(mean(Age, na.rm = TRUE))
```

Now, let's find out the average age of females and males that survived (1) and died (0) by adding **Survived** into the `groub_by()`

```{r}
 
titanic %>%
  group_by(Sex, Survived) %>%
  summarize(mean_age = mean(Age, na.rm = TRUE))

```

Just like with `groub_by()` where we can have many variables, we can apply many operations in summarize.

```{r}
 
titanic %>%
  group_by(Sex, Survived) %>%
  summarize(mean_age = mean(Age, na.rm = TRUE),
            stdev = sd(Age, na.rm = TRUE),
            num_of_obs = n())

```

> Challenge 2

1. Calculate mean, stdev and number of observations for variables Sex in conjunction with Passanger class. Hint!: follow the example we just did.

2. Calculate the average Survived value per Pclass and Sex. Which combination of Pclass and Sex had the highest Survived value and which combination had the lowest?


### `mutate()`

The `mutate()` function allows you to create brand new variables by performing operations on the existing data in the dataframe. For example, let's create a new variable called 'status' that combines the information of the `Pclass` variable with the age variable.

```{r}
titanic %>%
  mutate(status = Pclass*Age)
```

> Advanced dplyr Challenge

Calculate the average Survived value for a group of 20 randomly selected females from each Pclass group. Then arrange the classes in descending order. Hint: Use the dplyr functions arrange() and sample_n(), they have similar syntax to other dplyr functions. Look at the help!

---

## Data plotting with the GGPLOT2 package

Ggplot2 is built on the [Grammar of Graphics](http://www.springer.com/gb/book/9780387245447). It is a package created by [Hadley Wickham](http://hadley.nz/).


The building blocks of a ggplot include:  

* a dataset 
* an aesthetic (position (x and y), color, shape, size, linetype, fill)
* a set of geoms--they define how data will be plotted (e.g geom_point is for scatterplots, while geom_bar is for bar plots)
* statistical transformation (optional)
* scales
* a coordinate system
* facetting (for creating subplots)

There is a handy [cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) for using ggplot2.

And a crazy amount of graphs and options you can customise - [check it out!](http://ggplot2.tidyverse.org/index.html)

Examples of GEOMS:

* `geom_point()` - will create a scatterplot
* `geom_bar()` - will create a bar plot
* `geom_boxplot()` - you get the idea...
* `geom_line()`
* `geom_errorbar()`
* `geom_dotplot()`

## Scatter plot example

Below we make a simple scatterplot of Age versus Fare and colour the points by the passenger's sex.

```{r}
 
titanic %>%
  ggplot(aes( x = Age, y = Fare, color = Sex )) + 
  geom_point()

```

## Dplyr + ggplot

We can mix in some `dplyr` verbs we learnt earlier to zoom in on a data of interest within our graph. 

```{r}
 
titanic %>%
  filter(Fare < 200) %>%
  ggplot(aes(x = Age, y = Fare, color = Sex)) + 
  geom_point()

```

## Bar plots

Creating bar plots : `geom_bar()`

```{r}
 
titanic %>%
  ggplot(aes(x = Pclass, fill = Sex)) +
  geom_bar(position = "dodge")

```

## Avoiding overplotting

Visualizing data points with geom_jitter() 

```{r}
 
titanic %>%
  ggplot(aes(x = Pclass, y = Age, color = Sex)) +
  geom_jitter()

```

You can also try `geom_hex()`, `geom_count()` or `geom_bin2d()`.

## Facetting

Creating multiple plots using `facet_grid()`

```{r}
 
titanic %>%
  ggplot(aes(x= Pclass, y = Age, color = Sex)) +
  geom_jitter() +
  facet_grid(~Survived)

```

---

# Useful Resources: local and remote places to practice R

Cheatsheets: R for data science:

* [for various R tools - RStudio cheatsheets](https://www.rstudio.com/resources/cheatsheets/)  

* [RStudio - IDE](https://github.com/rstudio/cheatsheets/blob/master/rstudio-ide.pdf)  

* [data import](https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf)  

* [dplyr](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)  

* [ggplot2](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf)

R-Ladies resources: [global repository](https://github.com/rladies/resources)

* Stay tunes for future beginner R meetups with R-Ladies Vancouver!

* Ecoscope offers a variety of R worshops for all levels [Ecoscope](http://ecoscope.ubc.ca/events/) - next one is on reproducible research [check out tickets here](http://ecoscope.ubc.ca/events/)

* Check out a series of workshops on [Intro into Data Science](https://www.meetup.com/Women-Who-Code-Vancouver/events/249035344/) by [Women Who Code Vancouver](https://www.meetup.com/Women-Who-Code-Vancouver/)