tidy_data_dplyr.Rmd

# Walkthrough: Tidy Data & dplyr {#tidy}

![](images/banners/banner_tidy_data_dplyr.png)
*This chapter originated as a community contribution created by [akshatapatel](https://github.com/akshatapatel){target="_blank"}*

*This page is a work in progress. We appreciate any input you may have. If you would like to help improve this page, consider [contributing to our repo](contribute.html).*

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(MASS)
class(biopsy)

library(dplyr)
```

## Overview

This example goes through some work with the `biopsy` dataset using `dplyr` functions to get to a tidy dataset.

### Packages 

* [`dplyr`](https://www.rdocumentation.org/packages/dplyr){target="_blank"}
* [`MASS`](https://www.rdocumentation.org/packages/MASS/versions/7.3-51.1){target="_blank"}
* [`tidyr`](https://www.rdocumentation.org/packages/tidyr/versions/0.8.2){target="_blank"}


## Installing packages
Write the following statements in the console:

* `install.packages('dplyr')`
* `install.packages('ggplot2')`
* `install.packages('tidyr')`
* `install.packages('MASS')` 

**Note**: The first three packages are a part of the **tidyverse**, a collection of helpful packages in R, which can all be installed using `install.packages('tidyverse')`.

`dplyr` is used for data wrangling and data transformation in data frames. The "d" in "dplyr" stands for "data frames" which is the most-used data type for storing datasets in R.

## Viewing the data
Let's start with loading the package so we can get the data as a dataframe:
```{r, include=TRUE}
#loading the dplyr library
library(dplyr)

#loading data from MASS:biopsy
library(MASS)
class(biopsy)
```

```{r import_data}
#glimpse is a part of the dplyr package
glimpse(biopsy)

head(biopsy)
```

## What is Tidy data?
**What does it mean for your data to be *tidy*?**

**Tidy data** has a standardized format and it is a consistent way to organize your data in R. 

Here's the definition of Tidy Data given by Hadley Wickham:

>A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
>
>* Each variable forms a column.
>
>* Each observation forms a row.
>
>* Each observational unit forms a value in the table.
>
*See [r4ds on tidy data](https://r4ds.had.co.nz/tidy-data.html){target="_blank"} for more info.*

**What are the advantages of tidy data?**

* Uniformity : It is easier to learn the tools that work with the data because they have a consistent way of storing data.

* Most built-in R functions work with vectors of values. Thus, having variables as columns/vectors allows R’s vectorized nature to shine.

**Can you observe and tell why this data is messy?**

* The names of the columns such as V1, V2 are not intuitive in what they contain; good sign it is untidy.

* They are not different variables, but are values of a common variable.

Now, we will see the how to transform our data using `dplyr` functions and then look at how to tidy our transformed data.

## Tibbles

A **tibble** is a modern re-imagining of the data frame.

It is particularly useful for large datasets because it *only prints the first few rows*. It helps you confront problems early, leading to cleaner code.
```{r, include=TRUE}
# Converting a df to a tibble
biopsy <- tbl_df(biopsy)
biopsy
```


## Test for missing values

```{r, include=TRUE}
# Number of missing values in each column in the data frame
colSums(is.na(biopsy))
```

The dataset contains missing values which need to be addressed.

## Recode the missing values
One way to deal with missing values is to recode them with the average of all the other values in that column:
```{r, include=TRUE}
#change all the NAs to mean of the column
biopsy$V6[is.na(biopsy$V6)] <- mean(biopsy$V6, na.rm = TRUE)
colSums(is.na(biopsy))
```

See our chapter on [time series with missing data](missingTS.html) for more info about dealing with missing data.

## Data wrangling verbs
Here are the most commonly used functions that help wrangle and summarize data:

* Rename
* Select
* Mutate
* Filter
* Arrange
* Summarize
* Group_by

**Select** and **mutate** functions manipulate the *variable* (the columns of the data frame). **Filter** and **arrange** functions manipulate the *observations* (the rows of the data) ,whereas the **summarize** function manipulates *groups* of observations.

All the `dplyr` functions work on a copy of the data and return a modified copy. They do **not** change the original data frame. If we want to access the results afterwards, we need to save the modified copy.

## Rename

The names of the columns in our  biopsy data are very vague and do not give us the meaning of the values in that column. We need to change the names of the column so that the viewer gets a sense of the values they're referring to.

```{r, include=TRUE}
rename(biopsy,
       thickness = V1,cell_size = V2,
       cell_shape = V3, marg_adhesion = V4,
       epithelial_cell_size = V5, bare_nuclei = V6,
       chromatin = V7, norm_nucleoli = V8, mitoses = V9)
```

The tibble shown above is not saved and cannot be used further. To use it afterwards we save it as a new tibble:
```{r, include=TRUE}
#saving the rename function output
biopsy_new<-rename(biopsy,
       thickness = V1,cell_size = V2,
       cell_shape = V3, marg_adhesion = V4,
       epithelial_cell_size = V5, bare_nuclei = V6,
       chromatin = V7, norm_nucleoli = V8, mitoses = V9)

head(biopsy_new,5)
```

The `biopsy_new` data frame can now be used for further manipulation.

## Select
**Select** returns a subset of the data. Specifically, only the columns that are specified are included.

In the biopsy data, we do not require the variables "chromatin" and "mitoses". So, let's drop them using a minus sign:
```{r, include=TRUE}
#selecting all except the columns chromatin and mitoses
biopsy_new<-select(biopsy_new,-chromatin,-mitoses)

head(biopsy_new,5)
```

## Mutate
The **mutate** function computes new variables from the already existing variables and adds them to the dataset. It gives information that the data already contained but was never displayed.

The "V6" variable contains the values of the bare nucleus from 1.00 to 10.00. If we wish to normalize the variable, we can use the mutate function:
```{r, include=TRUE}
#normalize the bare nuclei values 
maximum_bare_nuclei<-max(biopsy_new$bare_nuclei,na.rm=TRUE)
biopsy_new<-mutate(biopsy_new,bare_nuclei=bare_nuclei/maximum_bare_nuclei)

head(biopsy_new,5)
```

## Filter
**Filter** is the row-equivalent function of **select**; it returns a modified copy that contains only certain rows.
This function *filters* rows based on the content and the conditions supplied in its argument. 
The filter function takes the data frame as the first argument. The next argument contains one or more logical tests. The rows/observations that pass these logical tests are returned in the result of the filter function.

For our example, we only want the data of those tumor cells that have clump thickness greater than six as most of the malign tumors have this thickness looking at a plot of clump thickness vs tumor cell size grouped by class:
```{r, include=TRUE}
library(ggplot2)

ggplot(biopsy_new)+
  geom_point(aes(x=thickness,y=cell_size,color=class))+
  ggtitle("Plot of Clump Thickness Vs Tumor Cell Size")

```

```{r, include=TRUE}
#normalize the bare nuclei values 
biopsy_new<-filter(biopsy_new,thickness>5.5)

head(biopsy_new,5)
```


## Arrange
**Arrange** reorders the rows of the data based on their contents in the ascending order by default.

The doctors would want to view the data in the order of the cell size of the tumor.
```{r, include=TRUE}
#arrange in the order of V2:cell size
arrange(biopsy_new,cell_size)
```

This shows the data in increasing order of the cell size.

To arrange the rows in decreasing order of V2, we add the `desc()` function to the variable before passing it to arrange.

```{r, include=TRUE}
#arrange in the order of V2:cell size in decreasing order
arrange(biopsy_new,desc(cell_size))
```

As you can see, there are a number of rows with the same value of `V2:cell_size`. To break the tie, you can add another variable to be used for ordering when the first variable has the same value.

Here, we use the tie breaker as the order of variable V3: by cell shape and by ID:
```{r, include=TRUE}
#arrange in the order of V2:cell size
biopsy_new<-arrange(biopsy_new,desc(cell_size),desc(cell_shape),ID)

head(biopsy_new,5)
```

## Summarize & Group By
**Summarize** uses the data to create a new data frame with the summary statistics such as minimum, maximum, average, and so on. These statistical functions must be aggregate functions which take a vector of values as input and output a single value.

The **group_by** function groups the data by the values of the variables. This, along with summarize, makes observations about groups of rows of the dataset.

The doctors would want to see the maximum cell size and the thickness for each of the classes: benign and malignant. This can be done by grouping the data by class and finding the maximum of the required variables:
```{r, include=TRUE}
biopsy_grouped <- group_by(biopsy_new,class)
summarize(biopsy_grouped, max(thickness), mean(cell_size), var(norm_nucleoli))
```

## Pipe Operator
What if we want to use the various data wrangling verbs **together**?

This could be done by saving the result of each wrangling function in a new variable and using it for the next function as we did above. However, this is not recommended as:

1. It requires extra typing and longer code.

2. Unnecessary space is used up to save the various variables. If the data is large, this method slows down the analysis.

The **pipe operator** can be used instead for the same purpose. The operator is placed between and object and the function. The pipe takes the object on its left and passes it as the first argument to the function to its right.

The pipe operator is a part of the `magrittr` package. However, this package need not be loaded as the `dplyr` package makes life simpler and imports the pipe operator for us:
```{r, include=TRUE}
biopsy_grouped <- biopsy_new %>% 
  group_by(class) %>% 
  summarize(max(thickness),mean(cell_size),var(norm_nucleoli))

head(biopsy_grouped)
```


## Tidying the transformed data

Have a look again at the messy data:

```{r, include=TRUE}
# Messy Data
head(biopsy_new,5)
```

Planning is required to decide which columns we need to keep unchanged, which ones to change, and what names are to be given to the new columns. The columns to keep are the ones that are already tidy. The ones to change are the ones that aren't true variables but in fact levels of another variable.

So, the `ID` and `class` columns are already tidy. These are kept as is.

The columns `V1:thickness`, `V2:cell_size`, `V3:cell_shape`, `V4:marg_adhesion`, `V5:epithelial_cell_size`, `V6:bare_nuclei`, and `V8:norm_nucleoli` are not true variables but values of the variable `Tumor_attributes`.

We can fix this with `tidyr::gather()`, which is used to convert data from messy to tidy. The `gather` function takes the data frame which we want to tidy as input. The next two parameters are the names of the key and the value columns in the tidy dataset.

In our example, key='Tumor_Atrributes' and value='Score'. You can also specify the columns that you do not want to be tidied, i.e. ID and class:
```{r, include=TRUE}
#Tidy Data
library(tidyr)
tidy_df <- biopsy_new %>% gather(key = "Tumor_Attributes", value = "Score", -ID, -class)

tidy_df
```

## Helpful Links
- [r4ds on tidy data](https://r4ds.had.co.nz/tidy-data.html){target="_blank"}: It is always best to learn from the source, so a textbook written by Hadley Wickham is perfect.
- [DataCamp dplyr course](https://www.datacamp.com/courses/dplyr-data-manipulation-r-tutorial){target="_blank"}: This course covers the different fucntions in dplyr and how they manipulate data.