-
Notifications
You must be signed in to change notification settings - Fork 0
An introduction to R
License
fchuffar/intro_r
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
--- title: "R introduction" author: "Florent Chuffart" date: "`r Sys.Date()`" output: rmarkdown::html_document: toc: true toc_float: true toc_depth: 3 number_sections: true --- This R session is fully available on https://github.com/fchuffar/intro_r # Purpose This first R working session will be an introduction to your favourite statistical programming language: R. After a brief introduction to basics (user interface paradigms, free software and GNU), we will go deeply into the details of data structures, data indexing and data visualisation. The main purpose of this session is to offer a not only simple but also powerful alternative to your classical spreadsheet software. This introduction gives you the keys to perform large scale statistical data analysis. This introduction follows three main goals: * deploy a fonctionnal working environment for statistical analysis on personnal laptop of participants * design a first script allowing participants to import into R environment data from classical spreadsheet * define common ground of a discussion between life scientists and data analysts # Prerequisistes Chaque participant apportera son ordinateur portable sur lequel aura été installé les logiciels R et Rstudio : * Installer R (macosx https://cran.r-project.org/bin/macosx/ ) (windows https://cran.r-project.org/bin/windows/base/ ) * Installer RStudio Desktop Open Source License https://www.rstudio.com/products/rstudio/download/ Le contenu de la session est disponible ici : https://github.com/fchuffar/intro_R Il doit être téléchargé en cliquant sur le bouton vert en hautr à droite "Clone or Download", puis "Download ZIP". La première séance se déroulera intégralement dans cette espace de travail. # R Software ## GNU Public Licence R is a free software environment for statistical computing and graphics. More details about GNU, GPL and free software here: - https://www.gnu.org/philosophy/free-sw.html - https://en.wikipedia.org/wiki/GNU_General_Public_License - https://en.wikipedia.org/wiki/Free_software_movement - https://en.wikipedia.org/wiki/CeCILL ## R and biology R is widely used in the life science community: - https://www.ncbi.nlm.nih.gov/pubmed/?term=r+software ## R Community Many packages are freely available on: - CRAN https://cran.r-project.org - Bioconductor https://www.bioconductor.org - GitHub https://github.com R users are organised as a community with many help topics available on the internet. Many tutorial are available online: - http://revue.sesamath.net/IMG/pdf/RCarte_Commandes-R.pdf - https://cran.r-project.org/doc/contrib/Kauffmann_aide_memoire_R.pdf - https://cran.r-project.org/doc/contrib/Paradis-rdebuts_fr.pdf # R CLI vs. GUI ## CLI R could be compared to an electronic calculator. It provides not only basic operations but also advanced features allowing to manipulate distributions and visualize data. ```{r} 1+1 2+2 pi rnorm(100) plot(density(rnorm(100))) lines(density(rnorm(100, sd=2)), col=2) lines(density(rnorm(100, sd=0.3, mean=1)), col=3) ``` ## R GUI (RStudio) *** https://www.rstudio.com # Let's start with R! ## Working directory ```{r, eval=FALSE} getwd() # this command give the help on... # ?getwd() # ?rnorm # ?setwd() ``` ## Organisation of your workspace - data - results - src - doc ## From a spreadsheet to R data.frame *** ```{r} #data_sh = read.table("data/seahorse_dataset2.csv", sep=";", header=TRUE, dec=".") data_sh = read.table("data/sp_exp_grp.csv", sep=" ", header=TRUE, dec=".") data_sh["GSE21749_P_WT_1",] data_sh[,1] head(data_sh) tail(data_sh) dim(data_sh) summary(data_sh) d = read.table("data/data_nutri.csv", header=TRUE, sep=",", row.names = 1) head(d) tail(d) dim(d) summary(d) data_sh[1,] data_sh[,1] ``` # Data ## Affectation ```{r} foo <- 1 foo bar = "a" bar = 'a' bar # bar = a a = 4 a bar = a bar "b" -> baz baz quz <<- baz ``` ## Common types - numeric - character - logical (boolean) - factor ```{r} 1 is.numeric(1) a=1 is.numeric(a) a="nom" a is.numeric(a) 0.5 is.numeric(0.5) is.numeric("a") foo = 1 is.numeric(foo) is.integer(1) a = as.integer(1) a is.integer(a) as.integer(0.3) as.integer(TRUE) as.integer(FALSE) as.numeric(TRUE) as.numeric(FALSE) "iab" 'toto' is.numeric("a") is.character("a") is.character(1) as.character(1) as.numeric("a") conc = c(0.1, 0.2, 0.3, "probleme de manip", 0.1) conc = c(0.1, 0.2, 0.3, "probleme de manip", 0.1) conc as.numeric(conc) conc2 = c(0.1, 0.2, 0.3, NA, 0.1) conc2 mean(conc2) mean(conc2, na.rm=TRUE) NaN log(0) 1/0 # logicals TRUE FALSE T F T = 3 T = FALSE T # TRUE = 3 # NA = 3 is.numeric(TRUE) as.numeric(TRUE) as.numeric(FALSE) as.logical(1) as.logical(0) as.logical(2) as.logical(-2) as.logical(-2.2) is.logical("true") is.character("true") other_var = as.logical("true") other_var as.logical("True") # factors col_y = c("bleu", "bleu", "vert", "marron", "vert", "marron", "blue", " blue") col_y f = as.factor(col_y) f col_y = c("bleu", "bleu", "vert", "marron", "vert", "marron", "bleu", "bleu") f = as.factor(col_y) f as.numeric(f) ``` ## Set of data | | Homogeneous | Heterogeneous | |----|-------------|---------------| | 1d | vector | list | | 2d | matrix | data.frame | a data.frame is a list of same size vectors (colums). ```{r} 1:5 c(1,2,10) c(1,"a",10) c(TRUE,0) v = rnorm(100) v v[c(5,18,39)] v[98:100] v[c(1,2,3)] v[1:3] h = v v v[c(1,2,3)] list(1,"a",10) l = list(chr="8", tx_start=124330091, tx_end=124410705, gs="ATAD2", ref="hg19", strand="-") l l$chr c(tx_start=124330091, tx_end=124410705) # matrix m = matrix(runif(120),nrow=20) dim(m) m m+1 m[20,6] m[19:20,5:6] m[c(17,20),] m[,c(1,2)] m[,c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE)] data = data.frame(size=rnorm(25)+175, gender=sample(c("M", "F"), 25, replace=TRUE)) data[1:5,] head(data) data$gender data[1,2] = "m" data[1,2] data[1,2] = "M" data[data$gender=="M",] data[data$gender=="M" & data$size > 170,] factor(sample(c("10", "J", "Q", "K", "A"), 25, replace=TRUE)) factor(sample(c("10", "J", "Q", "K", "A"), 25, replace=TRUE), levels=c("10", "J", "Q", "K", "A"), ordered=TRUE) ``` ## Selecting lines and columns of a dataset *** ```{r} v = 100:130 v v[1] v[-1] rev(v)[1] v[1:4] # index by the rank v[c(1, 4, 10, 15)] # index by mask col_y col_y[c(TRUE,FALSE, TRUE, TRUE, FALSE, TRUE)] ## select a column # by name data_sh$id # by rank data_sh[,2] data_sh[,2:6] data_sh[,c(FALSE,TRUE,TRUE,TRUE,TRUE,TRUE,FALSE,FALSE, FALSE,FALSE,FALSE)] ``` # Visualization ```{r, eval=FALSE} hist(data$size) abline(v=mean(data$mean)) boxplot(size~gender, data) qnorm(data$size) qqline(data$size) table(data$gender) # scatter plot barplot(data$gender) # b, l, t, r par(mar=c(10,1,1,1)) ``` # Tests ## Dataset ```{r} # dataset nutri data_nutri = read.table("data/data_nutri.csv", sep=",", header=TRUE, row.names=1) head(data_nutri) tail(data_nutri) dim(data_nutri) # Correlation between taille and poids plot(data_nutri$taille, data_nutri$poids, col=data_nutri$sexe) # create index for categ. idx_h = data_nutri$sexe == "Homme" idx_f = data_nutri$sexe == "Femme" # plot density of each categ. plot(density(data_nutri$taille[idx_h]), col=2, xlim=range(data_nutri$taille)) lines(density(data_nutri$taille[idx_f])) boxplot(taille~sexe, data=data_nutri) library(beanplot) beanplot(taille~sexe, data=data_nutri) library(beeswarm) beeswarm(taille~sexe, data=data_nutri) ``` ## Parametric ```{r} # test de Shapiro-Wilks # H0: normality # H1: non normality # alpha: 5% norm_pop = rnorm(100) plot(density(norm_pop)) shapiro.test(norm_pop) # pval = 0.439 > alpha = 5% # on conserve H0 au risque beta à calculer... unif_pop = runif(100) plot(density(unif_pop)) test = shapiro.test(unif_pop) test$p.value # pval = 0.0001 < alpha = 5% # on rejette H0, on accepte H1 au risque de première espèce alpha = 5% taille_h = data_nutri$taille[idx_h] shapiro.test(taille_h) # pval = 9.7% > alpha = 5% # on conserve H0 au risque beta à calculer... beta contrôlé dans le cas de shapiro.test. taille_f = data_nutri$taille[idx_f] shapiro.test(taille_f) # pval = 11% > alpha = 5% # on conserve H0 au risque beta à calculer... beta contrôlé dans le cas de shapiro.test. # Homoscedasticity # H0: homoscedasticity # H1: non homoscedasticity # alpha: 5% pop1 = rnorm(100, 0, 1) pop2 = rnorm(100, 10, 1) plot(density(pop1), xlim=c(-5,15)) lines(density(pop2), col=2) # homogénéité des variances bartlett.test( c(pop1, pop2), c(rep("a", 100), rep("b", 100)) ) # pval > 5% donc je conserve H0 au rosque beta à calculer pop1 = rnorm(100, 0, 0.1) pop2 = rnorm(100, 10, 1) plot(density(pop1), xlim=c(-5,15)) lines(density(pop2), col=2) # non homogénéité des variances bartlett.test( c(pop1, pop2), c(rep("a", 100), rep("b", 100)) ) # pval < 5% donc je rejette H0 et j'accepte H1 au risque alpha de 5%. bartlett.test(data_nutri$taille, data_nutri$sexe) # pval = 16% donc je conserve H0 au seuil beta à calculer. # Student test, t test, # H0: t_h == t_f # H1: t_h != t_f # alpha: 5% t.test(taille_h, taille_f, conf.level = 0.95) # pval = 2.2e-16 << 5% on rejette H0 et on accepte H1 au seuil alpha de 5%. # H0: t_h == t_f # H1: t_h > t_f # alpha: 5% t.test(taille_h, taille_f, conf.level = 0.95, alternative="greater") t.test(taille_h[1:10], taille_f[1:10], conf.level = 0.95, alternative="greater") t.test(taille_h[1:10], taille_f[1:10], conf.level = 0.95, alternative="two.sided") # pval = 2.2e-16 << 5% on rejette H0 et on accepte H1 au seuil alpha de 5%. # H0: t_h == t_f # H1: t_h < t_f # alpha: 5% t.test(taille_h, taille_f, conf.level = 0.95, alternative="less") # pval = 1 >> 5% on conserve H0 au risque beta, à calculer... ``` # Non parametric ```{r} # H0: t_h == t_f # H1: t_h != t_f # alpha: 5% wilcox.test(taille_h[1:10], taille_f[1:10], alternative = "two.side") # Mann Withney # pval = 0.0001 < alpha = 5% donc je rejette H0 et j'accepte H1 au seuil alpha de 5% # H0: t_h == t_f # H1: t_h > t_f # alpha: 5% test = wilcox.test(taille_h[1:10], taille_f[1:10], alternative = "greater") # Mann Withney # pval = 0.00008 < alpha = 5% donc je rejette H0 et j'accepte H1 au seuil alpha de 5% if (test$p.value < 0.001) { label = paste(signif(test$p.value,2), "***") } else if (test$p.value < 0.01) { label = paste(signif(test$p.value,2), "**") } else if (test$p.value < 0.05) { label = paste(signif(test$p.value,2), "*") } else { label = signif(test$p.value,2) } plot(density(data_nutri$taille[idx_h]), col=2, xlim=range(data_nutri$taille)) lines(density(data_nutri$taille[idx_f])) text(165,0.04, labels=label) boxplot(data_nutri$taille~data_nutri$sex, las=2) text(1.5,180, labels=label) ``` ## Empirical p-value ```{r} stat_obs = wilcox.test(unique(taille_h)[1:5], unique(taille_f)[1:5], alternative = "two.side")$statistic # Mann Withney stat_rnds = sapply(1:10000, function(i) { rnd_pop = sample(c(unique(taille_h)[1:5], unique(taille_f)[1:5])) stat_rnd = wilcox.test(rnd_pop[1:5], rnd_pop[6:10], alternative = "two.side")$statistic # Mann Withney return(stat_rnd) }) pval_emp = (sum(stat_rnds > stat_obs) + 1)/length(stat_rnds) print(pval_emp) plot(density(stat_rnds)) abline(v=stat_obs) ``` # Statistical test ```{r, eval=FALSE} plot_null = function(n, mean=0, sd=1, thresh=TRUE, ...){ foo = sapply(1:10000, function(i) { bar = rnorm(n, mean, sd) m = mean(bar) s = sd(bar) c(mean=m, sd = s) }) foo = data.frame(t(foo)) t = foo$mean / (foo$sd / sqrt(10)) lines(density(foo$mean), ...) if (thresh) abline(v=quantile(foo$mean, c(0.025,0.975)), lty=2, ...) # lines(density(t), lty=2, ...) # lines(density(rt(10000, n-1)), col="grey", lty=3) } layout(matrix(1:2, 1), respect=TRUE) plot(0,0,col=0,xlim=c(-1,1), ylim=c(0,2)) plot_null(10) plot_null(10,mean=0.25, thresh=FALSE, col=4) # plot_null(20, col=2) plot(0,0,col=0,xlim=c(-1,1), ylim=c(0,2)) plot_null(30) plot_null(30,mean=0.25, thresh=FALSE, col=4) # plot_null(20, col=2) # H0: m == 0 # H1: m != 0 # alpha = P(reject H0 | H0 is true) # beta = P(accept H0 | H0 is false) ``` # To go further with R ## Scripts ```{r, eval=FALSE} source("intro_r.R") ``` ## Functions ### Common functions ```{r} x = c(1,2,3,4,5) y = c(2,4,6,8,10) length(x) plot(x,y) rnd = runif(100000) dices = round(rnd * 6) table(dices) barplot(table(dices)) dices = floor(rnd * 6) barplot(table(dices)) dices = ceiling(rnd * 6) barplot(table(dices)) pie(table(dices)) mat_5_dices = matrix(dices,5) dim(mat_5_dices) mat_5_dices[,1:10] sum_5_dices = apply(mat_5_dices, 2, sum) table(sum_5_dices) barplot(table(sum_5_dices)) prop.table(table(sum_5_dices)) prop.table(table(sum_5_dices)) sum(prop.table(table(sum_5_dices))) ``` ```{r, eval=FALSE} ?ls() ?paste() ?sum() ?cumsum() ?sqrt() ?rnorm() ?qnorm() ?pnorm() ?cbind() ?rbind() ?sort() ?order() ?grep() ?read.table() ?write.table() ``` ### Custom functions - signature (parameter) - return - scope and global variables ```{r} hello_world = function(name="world") { ret = paste("Hello ", name, "!", sep="") return(ret) } hello_world() ``` # Practical session ## SRSF2 and SOX2 signatures in squamous cell lung carcinomas (data set suggested by Beatrice Eymin) We dispose of 5 Affymetrix Human Genome U133A Array probe values for 130 samples These 5 probes correspond to 2 genes: - 200753_x_at (SRSF) - 200754_x_at (SRSF) - 214882_s_at (SRSF) - 213721_at (SOX2) - 213722_at (SOX2) We also dispose of clinical data for the 130 samples. i) We want to check the correlation between probes ii) We want to analyse the expressin of these genes according cancer stage ```{r} # loading data data = read.csv("data/GSE4573.csv", row.names=1) head(data) dim(data) # correlation between probes pairs((data[,1:5])) pairs2 = function (...) { panel.cor = function(x, y, digits = 2, prefix = "", cex.cor, ...) { usr = par("usr") on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r = abs(cor(x, y)) txt = format(c(r, 0.123456789), digits = digits)[1] txt = paste(prefix, txt, sep = "") if (missing(cex.cor)) cex = 0.8/strwidth(txt) test = cor.test(x, y) Signif = symnum(test$p.value, corr = FALSE, na = FALSE, cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1), symbols = c("***", "**", "*", ".", " ")) text(0.5, 0.5, txt, cex = cex * r) text(0.8, 0.8, Signif, cex = cex, col = 2) } pairs(..., upper.panel = panel.cor) } pairs2((data[,1:5])) # distribution layout(matrix(1:6, 2), respect=TRUE) plot(density(data[,"X200753_x_at"])) plot(density(data[,"X200754_x_at"])) plot(density(data[,"X214882_s_at"])) plot(density(data[,"X213721_at" ])) plot(density(data[,"X213722_at" ])) # stages data$STAGE idx_grp1 = data$STAGE %in% c("Ia", "Ib", "IIa", "IIb") idx_grp2 = data$STAGE %in% c("IIIa", "IIIb") data$stage_grps = NA data[idx_grp1,]$stage_grps = "grp1" data[idx_grp2,]$stage_grps = "grp2" layout(matrix(1:6, 2), respect=TRUE) beanplot::beanplot(X200753_x_at~stage_grps, data) beanplot::beanplot(X200754_x_at~stage_grps, data) beanplot::beanplot(X214882_s_at~stage_grps, data) beanplot::beanplot(X213721_at ~stage_grps, data) beanplot::beanplot(X213722_at ~stage_grps, data) t.test(data[idx_grp1, "X200753_x_at"], data[idx_grp2, "X200753_x_at"]) t.test(data[idx_grp1, "X200754_x_at"], data[idx_grp2, "X200754_x_at"]) t.test(data[idx_grp1, "X214882_s_at"], data[idx_grp2, "X214882_s_at"]) t.test(data[idx_grp1, "X213721_at" ], data[idx_grp2, "X213721_at" ]) t.test(data[idx_grp1, "X213722_at" ], data[idx_grp2, "X213722_at" ]) wilcox.test(data[idx_grp1, "X200753_x_at"], data[idx_grp2, "X200753_x_at"]) wilcox.test(data[idx_grp1, "X200754_x_at"], data[idx_grp2, "X200754_x_at"]) wilcox.test(data[idx_grp1, "X214882_s_at"], data[idx_grp2, "X214882_s_at"]) wilcox.test(data[idx_grp1, "X213721_at" ], data[idx_grp2, "X213721_at" ]) wilcox.test(data[idx_grp1, "X213722_at" ], data[idx_grp2, "X213722_at" ]) layout(matrix(1:6, 2), respect=TRUE) beanplot::beanplot(X200753_x_at~SEX, data) beanplot::beanplot(X200754_x_at~SEX, data) beanplot::beanplot(X214882_s_at~SEX, data) beanplot::beanplot(X213721_at ~SEX, data) beanplot::beanplot(X213722_at ~SEX, data) layout(matrix(1:6, 2), respect=TRUE) plot(data[,"X200753_x_at"], data$AGE) plot(data[,"X200754_x_at"], data$AGE) plot(data[,"X214882_s_at"], data$AGE) plot(data[,"X213721_at" ], data$AGE) plot(data[,"X213722_at" ], data$AGE) layout(matrix(1:6, 2), respect=TRUE) beanplot::beanplot(X200753_x_at~T, data) beanplot::beanplot(X200754_x_at~T, data) beanplot::beanplot(X214882_s_at~T, data) beanplot::beanplot(X213721_at ~T, data) beanplot::beanplot(X213722_at ~T, data) ``` ## Control structure A control structure is a block of programming that analyzes variables and chooses a direction in which to go based on given parameters. ## Algorithm An algorithm is a self-contained sequence of actions to be performed. An algorithm is an effective method that can be expressed within a finite amount of space and time. # References http://adv-r.had.co.nz http://r-pkgs.had.co.nz # Exercices http://graal.ens-lyon.fr/~fchuffart/files/data/exercices.pdf # RegLog ```{r} d = data_nutri layout(matrix(1:2, 1), respect=TRUE) plot(d$taille, d$poids) # Y~B # E(Y) = b.X # E(Y) = b_0 + b_1.X # Y_i = b_0 + b_1.X_i + e_i m = lm(d$poids~d$taille) m = lm(poids~taille, d) m$coefficients m$residuals abline(a=m$coefficients[[1]], b=m$coefficients[[2]], col=2) # arrows(d$taille, d$poids, d$taille+m$residuals, d$poids, col=adjustcolor(1, alpha.f=0.5)) arrows(d$taille, d$poids, d$taille, d$poids-m$residuals, col=adjustcolor(1, alpha.f=0.5)) layout(matrix(1:2, 1), respect=TRUE) plot(d$taille, d$poids) # Y~B # E(Y) = b.X # E(Y) = b_0 + b_1.X # Y_i = b_0 + b_1.X_i + e_i m = lm(d$poids~d$taille) m$coefficients abline(a=m$coefficients[[1]], b=m$coefficients[[2]], col=2) plot(d$taille, d$poids-m$residuals, col=1) arrows(d$taille, d$poids, d$taille, d$poids-m$residuals, col=adjustcolor(1, alpha.f=0.5)) d = data_nutri layout(matrix(1:2, 1), respect=TRUE) plot(d$sex, d$poids) # Y~B # E(Y) = b.X # E(Y) = b_0 + b_1.X # Y_i = b_0 + b_1.X_i + e_i m = lm(d$poids~d$sex) m$coefficients abline(a=m$coefficients[[1]], b=m$coefficients[[2]], col=2) plot(1:nrow(d), d$poids-m$residuals, col=1) arrows(1:nrow(d), d$poids, 1:nrow(d), d$poids-m$residuals, col=adjustcolor(1, alpha.f=0.5)) abline(a=m$coefficients[[1]], b=m$coefficients[[2]], col=2) arrows(d$taille, d$poids, d$taille + 0.5*m$coefficients[[2]]*m$residuals, d$poids - sqrt(m$coefficients[[2]]*m$residuals), col=adjustcolor(1, alpha.f=0.5)) arrows(d$taille, d$poids, d$taille + 0.5*m$coefficients[[2]]*m$residuals, d$poids - sqrt(m$coefficients[[2]]*m$residuals), col=adjustcolor(1, alpha.f=0.5)) # plot( d$poids, d$taille) # m = lm(d$taille~d$poids) # m = lm(taille~poids, d) # m$coefficients # m$residuals # arrows(d$poids, d$taille, d$poids+m$residuals, d$taille, col=adjustcolor(1, alpha.f=0.5)) # abline(a=m$coefficients[[1]], b=m$coefficients[[2]], col=2) # plot(density(m$residuals)) plot(data_nutri$sexe, data_nutri$taille) ``` # Session Information ```{r, results="verbatim"} sessionInfo() ```
About
An introduction to R
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published