Skip to content

Introduction to R: July 28, 2021

Marisa Lim edited this page Jul 29, 2021 · 3 revisions

Introduction to R

Wednesday July 28, 2021 10am-noon PDT/ 1-3pm EDT

Co-Instructors: Abhijna Parigi, Marisa Lim

Moderator: Saranya Canchi

Description:

This free two hour workshop will provide an overview of the basics of R programming language along with RStudio which is a user-friendly environment for working with R. You will be introduced to the syntax, variables, functions, packages along with the various data structures in R. You will also learn basics of data wrangling from reading data from files, subsetting, merging and exporting data out of the R environment.

While we wait to get started --

  1. ✔️ Have you checked out the pre-workshop resources page?

  2. 📝 Please fill out our pre-workshop survey if you have not already done so! Click the survey link here

  3. 💻 Open a web browser - e.g., Firefox, Chrome, or Safari. Open the binder by clicking this button: Binder

Introduction

We are part of the training and engagement team for the NIH Common Fund Data Ecosystem, a project supported by the NIH to increase data reuse and cloud computing for biomedical research.

You can contact us at [email protected]

How to ask questions

If you have questions at any point,

  • Drop them in the chat, or
  • Direct messages to the moderator or helpers are welcome, or
  • Unmute yourself and ask during the workshop

We're going to use the "raise hand" reaction in zoom to make sure people are on board during the hands-on activities.

Some background

What is R?

R is a statistical computing and data visualization programming language.

What is RStudio?

R is often used via integrated development environments (IDE), and RStudio is probably the most popular option. R and RStudio work on Mac, Linux, and Windows operating systems. The RStudio layout displays lots of useful information and allows us to run R code from a script and view and save outputs all from one interface.

  • RStudio panels
  • The 4 panels:
    • View R scripts
    • Console to run code
    • Environment, history and collections
      • Environment: collection of objects, variables, and functions
      • View/load/manage R package list
    • View file system, help docs, view/save plots

Let's customize the layout to see code better!


Terminology

  • Function: A key feature of R is functions. Functions are “self contained” modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe etc.), process it, and return a result.
  • Variable/object: Variables are containers for storing data values
  • Assignment operator: The <- is called an assignment operator and it is an R convention for assigning values to variable names (i.e., in Python it is =).
  • Packages: R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as CRAN (the Comprehensive R Archive Network)
  • Library: Often used interchangeably with package. A library is usually a single file containing many functions or objects, and a package can contain libraries and other files.

Our example dataset

We are using sequencing data from a European Nucleotide Archive dataset (PRJEB5348).

This dataset is for an RNA-Seq experiment comparing two yeast strains, SNF2 (mut) and WT. They sequenced 96 samples in 7 different lanes, with 48 wt and 48 mut strains.

So there are 48 biological replicates and 7 technical replicates, with a total of 672 samples!

Loading packages

Let's start by loading the packages we will be using!

Please note!

The Rstudio binder comes pre-loaded with R and the R packages we're using in today's workshop. On your local computer, you need to install R, Rstudio, and the R packages before using them. Here's the code to install R packages:

# install
install.packages('ggplot2')
install.packages('readr')
install.packages('dplyr')

All of these packages belong to a giant package called tidyverse. On your local computer, you can install all these useful packages and more by simply intalling tidyverse.

# install.packages('tidyverse')

We are not doing this today because tidyverse makes the RStudio binder very slow.

Some Useful Info 🏝️

  • Nucleic Acid Conc. = RNA concentration in the sample in units of nanogram per microliter (ng/ul)
  • 260/280 = ratio measured with a spectrophotometer (i.e., Nanodrop) to assess RNA sample purity. For RNA, the ratio should be ~2.0
  • Total RNA = total RNA molecular weight in the sample in units of microgram (ugm)
  • General information about RNA quality for RNA-Seq

The library() function:

library(dplyr)
library(readr)
library(ggplot2)

The notes about masked objects mean that two or more R packages have functions with the same name

  • For example, both stats and dplyr have a functions called filter
  • check them out with ?stats::filter and ?dplyr::filter)

# are used for notes in scripts or "commented out" lines of code that we don't want to run

Reading in the data

Next, we will examine the RNA quality and quantity that the authors measured for all their samples.

We will explore this dataset using some base R and an RStudio package called dplyr.

The data file we're using today is stored in an online repository (https://osf.io/pzs7g/) and it is a tab-delimited .tsv file. To read in the dataset in R, we are using the read_tsv() function from the readr R package.

read_tsv(file = 'https://osf.io/pzs7g/download/')

Functions like read_tsv specify required and optional inputs within ( ).

Let's assign the input data table values to the variable experiment_info.

experiment_info <- read_tsv(file = 'https://osf.io/pzs7g/download/')

You will see a warning message, what does that means?

Warning: 3 parsing failures.
row col   expected    actual                             file
  2  -- 12 columns 9 columns 'https://osf.io/pzs7g/download/'
  3  -- 12 columns 9 columns 'https://osf.io/pzs7g/download/'
  4  -- 12 columns 9 columns 'https://osf.io/pzs7g/download/'

Warning message:
Missing column names filled in: 'X9' [9], 'X10' [10], 'X11' [11], 'X12' [12] 

Now let's look at the data!

# how big is the data set ?
dim(experiment_info)

# examine the first 6 rows across all columns of the data
head(experiment_info)

Tibbles are a type of data table -- a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. data.frame is another type of data table in R.


Data Wrangling with dplyr

Why might you need to wrangle data?

  • Real world data is messy and needs some formatting before you can use it for analysis.
  • Often, you are only interested in a subset of the data and want to avoid loading the entire data set.
  • Some programs/visualizations require data to be in a specific format.

dplyr is a R package that contains sets of tools for dataframe manipulation.

For this workshop we will cover three dplyr functions:

dplyr verb Description
select() subset columns
filter() subset rows matching some condition
mutate() create new columns using information from other columns

select()

There are some empty columns in the experiment data file we have been working with. Let's create a subset of data without those empty columns using the select function.

experiment_info_cleaned <- select(experiment_info,                                   
                            Sample, 
                            Yeast Strain, 
                            Nucleic Acid Conc., 
                            Unit, 
                            A260, 
                            A280,
                            260/280,
                            Total RNA)

# As before, since R cannot parse spaces in column names, we need to enclose them in backticks to indicate that these words belong together.

experiment_info_cleaned <- select(experiment_info,                                   
                            Sample, 
                            `Yeast Strain`, 
                            `Nucleic Acid Conc.`, 
                            Unit, 
                            A260, 
                            A280,
                            260/280,
                            `Total RNA`)

# As a general rule, it is best to avoid column names that start with a number; we can use backticks for this column name.

experiment_info_cleaned <- select(experiment_info,                                   
                            Sample, 
                            `Yeast Strain`, 
                            `Nucleic Acid Conc.`, 
                            Unit, 
                            A260, 
                            A280,
                            `260/280`,
                            `Total RNA`)

# Check the dimensions of the subsetted dataframe
dim(experiment_info_cleaned)

You can also use select to specify the columns you don't want. This can be done with the - notation.

# Remove two columns: A260 and A280
select(experiment_info, -A260, -A280)

Exercise 1

Select the columns Sample, Yeast Strain, A260 and A280 and assign them to a new object called "experiment_data".

Solution

experiment_data <-  select(experiment_info_cleaned, Sample, `Yeast Strain`, A260, A280)
# Check the data
head(experiment_data)

filter()

We learned to subset entire columns. Using the filter function, we can subset based on conditions, using a logical vector with TRUE and FALSE, or with operators.

The filter function works on rows, so we can use it to subset data based on conditions in a column. Here's how you would use the function:

filter(experiment_info_cleaned, `Nucleic Acid Conc.` > 1500)
filter(experiment_info_cleaned, `A260/A280` >= 2.0 & `Nucleic Acid Conc.` > 1500)

Here's a list of operators you can use:

Conditional Subsetting Description
> greater than the value
>= greater than or equal to the value
< less than the value
<= less than or equal to the value
l applies logical operator OR
& applies logical operator AND
== equal to the value
%in% operator used to identify if the value belongs to a vector
!= not equal to the value

mutate()

mutate can be used to create new columns in our data frame using existing data.

mutate(experiment_info_cleaned, 
concentration_ug_uL = `Nucleic Acid Conc.` / 1000) 

Pipes

So far, we have applied the actions individually. What if we wanted to apply select and filter at the same time?

We can use the concept of a pipe which is represented by %>% in R. This is equivalent to | in shell.

Pipes are available via the magrittr package, installed automatically with dplyr. If you use RStudio, the shortcut to produce the pipe is:

  • Ctrl + Shift + M on a PC or
  • Cmd + Shift + M on a Mac

Here's how you would use it:

experiment_info_wt <- experiment_info_cleaned %>% 
  filter(`Yeast Strain` == 'WT' & `Nucleic Acid Conc.` > 1500) %>% 
  select(Sample, `Yeast Strain`, A260, A280)

(when using pipes, the order of the command matters!)

Exercise 2

Create a new data table called exp_info_wrangled and only keep the samples that have 260/280 ratio values < 2.15. Add a new column called total_RNA_in_nano_grams that has total RNA in nano grams. Include the columns sample, yeast strain and 260/280 and Total_RNA_in_nano_grams

Solution

# create data object
exp_info_wrangled<- experiment_info_cleaned %>% 
  filter(`260/280` < 2.15) %>% 
  mutate(total_RNA_in_nano_grams = `Total RNA`*1000) %>% 
  select(Sample, `Yeast Strain`, `260/280`, total_RNA_in_nano_grams) 

# check the data
head(exp_info_wrangled)

==Instructor switch!==

Plotting with ggplot

ggplot and its related packages are a flexible and widely-used plotting R package. You can make standard plots (i.e., scatter, histogram, box, bar, density plots), as well as a variety of other plot types like maps and phylogenetic trees!

It's a very helpful tool for exploring your data/results and generating publication-ready figures.

We're going to create some plots with the dataframe from the previous section. To make it easier to specify column names, let's rename to remove spaces (you could also specify the column names using backticks as we did above):

# current column names
colnames(experiment_info_cleaned)
# new names specified with a vector
colnames(experiment_info_cleaned) <- c('Sample', 'Yeast_strain', 'Nucleic_Acid_Conc.',
                               'ng/ul', 'A260', 'A280', 'A260_280',
                               'Total_RNA', 'ugm')

The ggplot syntax always begins with a function that specifies the data and then layers on additional functions to specify a whole bunch of customizable plot aesthetics. Here, we specify the data object as the experiment_info_cleaned dataframe and set the axes with the mapping= flag.

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc.)) 

RStudio has some really helpful features for finding help docs, and for autofilling. Use the tab key to autofill variable or R package function names for example. Find help docs about functions or R packages with the ? key:

# help docs for ggplot() function
? ggplot 

# help docs for ggplot2 R package
? ggplot2

Let's make a scatter plot by adding geom_point() with a + to the ggplot() function:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc.)) +
  geom_point() +
  ggtitle('Scatter plot')

Notice the indentation after the + sign.

Add transparency to points (also works for other plot types!) with the alpha= option (0=completely transparent, 1=solid color) and change point size with size=:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc.)) +
         geom_point(alpha=0.6, size=4) +
         ggtitle('Scatter with transparent points')

We can modify the code to fill points by yeast strain and make point size correspond to the 260/280 ratio:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc., fill=Yeast_strain,
                   size=A260_280)) +
  geom_point(alpha=0.6, pch=21, col='white') +
  ggtitle('Scatter: color by strain \n and size by 260/280 ratio')

There are 2 strains in this dataset, WT and snf2.

unique(experiment_info_cleaned$Yeast_strain)

Let's make a boxplot by strain:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Yeast_strain, y=Total_RNA)) +
  geom_boxplot() +
  ggtitle('Boxplot')

Boxplots can be more informative if we add points to see the distribution of the data:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Yeast_strain, y=Total_RNA)) +
  geom_point(alpha=0.6) +
  geom_boxplot() +
  ggtitle('Boxplot w/ points')

Exercise 3

  • What happens if you reverse the order of the geom_point() and geom_boxplot() functions for the boxplot code above? Solutions Above, the points are plotted first, then the boxplot. If we reverse the order, the points will be plotted on top of the boxplot.

Exercise 4

  • What happens if you use geom_jitter() instead of geom_point()? Solutions geom_jitter() 'jitters' or spreads the points out so they are not overlapping as much.

More modifications and plot types!

  • Modifying axis label text: xlab() and ylab()
  • Facets: facet_wrap() or facet_grid()
ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc., col=Yeast_strain)) +
  geom_point() +
  facet_wrap(~Yeast_strain) +
  xlab('Total RNA (ugm)') +
  ylab('Nucleic Acid Conc. (ng/ul)') +
  ggtitle('Facet scatter plot')

Exercise 5

  • Try making 2 plots of the A260_280 ratio values: 1) a histogram with geom_histogram() and 2) a density plot with geom_density().

  • Hint: a histogram plots counts on the y-axis so you only need to specify the x-axis variable in the ggplot() function.

  • Bonus: how would you make 1 plot with a histogram for each strain? Solutions Here is the basic histogram code - you may need to modify the bin size:

    ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=A260_280))+
      geom_histogram(bins=20) +
      ggtitle('Histogram')

    Here is the basic density plot code:

    ggplot(data=experiment_info_cleaned,
        mapping=aes(x=A260_280)) +
      geom_density() +
      ggtitle('Density plot')

    Extras! Below are more ways to customize these plots.

    You could add colors and plot by strain type like this:

    ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=A260_280, fill=Yeast_strain))+
      geom_histogram(bins=20, col='black') +
      ggtitle('Histogram by strain')

    And we could facet the plot by strain like this:

    ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=A260_280, fill=Yeast_strain))+
      geom_histogram(bins=20, col='black') +
      facet_grid(Yeast_strain~.) + #this sets whether the facet is horizontal or vertical. here, we will get 2 rows of histograms by yeast strain
      ggtitle('Facet histogram')

    One more edit! Let's change the colors on the histogram. There are many built-in color palettes! Check more out here:

    ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=A260_280, fill=Yeast_strain))+
      geom_histogram(bins=20, col='black') +
      facet_grid(Yeast_strain~.) +
      scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) + # Add blue and yellow colors that are more colorblind friendly for plotting
      ggtitle('Histogram w/ custom colors')

    Now, let's switch this to a density plot and change the plot background

    ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=A260_280, fill=Yeast_strain))+
      geom_density(alpha=0.6) +
      facet_grid(Yeast_strain~.) +
      scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) +
      theme_bw() +
      ggtitle('Density plot')

Hopefully this exercise demonstrates how flexible ggplot is!

Save plots

Use the ggsave() function to save images (variety of file options available), for example:

myplot <- ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=A260_280, fill=Yeast_strain))+
      geom_density(alpha=0.6) +
      facet_grid(Yeast_strain~.) +
      scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) +
      theme_bw() +
      ggtitle('Density plot')
ggsave(filename='mydensityplot.jpg', plot=myplot, height=4, width=4, units='in', dpi=600)

:::info Want to save a figure with multiple plots? Check out cowplot!

# install cowplot package
install.packages('cowplot')
# load library
library(cowplot)
# assign plots to variables
hist_plt <- ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=A260_280))+
      geom_histogram(bins=20) +
      ggtitle('Histogram')
density_plt <- ggplot(data=experiment_info_cleaned,
        mapping=aes(x=A260_280)) +
      geom_density() +
      ggtitle('Density plot')

# save plot as a jpeg file. note that the syntax is similar to ggsave() but resolution is specified with res=, instead of dpi=
jpeg('histogram_and_density_plt.jpg', height=5, width=7, units='in', res=500)
# arrange the plot objects in a grid, label plots, and specify plots in 1 row
plot_grid(hist_plt, density_plt, labels=c('A', 'B'), nrow=1)
# shut down plotting device
dev.off()

:::

There are many more options to edit plot aesthetics and there are lots of excellent tutorials and resources, including:

:::warning Be sure to save any work/notes you took in the binder to your computer. Any new files/changes are not available when the binder session is closed!

For example, select a file, click "More", click "Export": :::

Wrap up

📝 Please fill out our post-workshop survey! We use the feedback to help improve our training materials and future workshops. Click here for survey!

Questions?

Acknowledgements: This lesson was inspired by material from 2019 ANGUS Intro to R lesson

Resources

Links for Moderator to share:

Schedule

clock time time section
10:00 - 10:15 15 mins intro
10:15 - 11:00 45 mins data types/wrangling
11:00 - 11:45 45 mins plotting
11:45 - 12:00 15 mins wrap up/Q&A

Appendix

Bonus exercises

https://hackmd.io/7rMbbd2oTB6ryx_qQYTlDw?view

More about data types

There are six data types in R:

Data Type Description Examples
Numeric numerical values 1, 1.5, 20, pi
Character letters or words, enclosed in quotes “random”, “2”, “TRUE”, “@”, “#”
Integer for integer numbers; the L indicates to R that it’s an integer) 2L, 500L, -17L
Logical for boolean type data TRUE, FALSE, T, F
Complex numbers with real and imaginary parts 4+ 3i
Raw contains raw bytes encoding
# what type of data structure is this ?
class(experiment_info) 

Some useful commands

:::info There are other useful commands to get attribute information about a dataset.

# how many number of rows ?
nrow(experiment_info)

# how many number of columns ?
ncol(experiment_info)

# what are the column names of your dataset ?
colnames(experiment_info)

# what are the rownames of your dataset ?
rownames(experiment_info)

# what does the last few lines of my dataset look like ?
tail(experiment_info)

# what is the structure of my dataset ?
str(experiment_info)

:::

data.frame is tabular data similar to an excel spreadsheet which can contain mixed data types.

vectors are uni-dimensional data structures with uniform data types.

Let's create a vector from the table we have.

a260_vector <- experiment_info$A260

# confirm the data structure
class(a260_vector)

# get the data type of the vector
typeof(a260_vector)

We can access individual elements within the vector using it's position. Indices start at 1 and use the [] notation.

# obtain the fifth element in the vector
a260_vector[5]

# subset the fifth and tenth element
a260_vector[c(5,10)]

# subset the fifth through tenth elements
a260_vector[c(5:10)]

Another, two dimensional data structure is the matrix. The major difference between matrix and dataframe is that matrix can have only one type of data type i.e. either character or numeric but not both!

Let's create a matrix containing the Nucleic Acid Conc., 260/280 ratio and Total RNA values.

# Create a matrix by subsetting the data frame by selecting the appropriate columns by column names
yeast_strain_matrix <- data.matrix(experiment_info[,c("Nucleic Acid Conc.",
                                                      "260/280",
                                                      "Total RNA")])
# View the data
head(yeast_strain_matrix )

Split-apply-combine

The approach of split-apply-combine allows you to split the data into groups, apply a function/analysis and then combine the results. We can do this using group_by() and summarize()

group_by() takes as argument the column name/s that contain categorical variables that can be used to group the data.

experiment_info_cleaned %>% 
  group_by(`Yeast Strain`) %>% 
  summarize(mean_concentration = mean(`Nucleic Acid Conc.`))
  
# summarize using more than one column
experiment_info_cleaned %>% 
  group_by(`Yeast Strain`) %>% 
  summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
  mean_total_RNA = mean(`Total RNA`))

Next, we can use arrange() to sort the data in either ascending (defalt order) or descending order.

# arrange new table in ascending mean concentrations
experiment_info_cleaned %>% 
  group_by(`Yeast Strain`) %>% 
  summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
  mean_total_RNA = mean(`Total RNA`)) %>% 
  arrange(mean_concentration) 
  
# arrange new table in descending mean concentrations 
experiment_info_cleaned %>% 
  group_by(`Yeast Strain`) %>% 
  summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
  mean_total_RNA = mean(`Total RNA`)) %>% 
  arrange(desc(mean_concentration)) 

We can also examine the number of categorical values using count()

experiment_info_cleaned %>% 
  count(`Yeast Strain`)
Clone this wiki locally