genomicsclass · Colin-Fredericks · Jun 22, 2022
diff --git a/biocintro_5x/bioc1_liftOver.Rmd b/biocintro_5x/bioc1_liftOver.Rmd
@@ -68,7 +68,7 @@ ah = AnnotationHub()
 q1 = query(ah, c("chain")) # list all resources with 'chain' in metadata
 q1
 q2 = query(ah, c("chain", "hg38ToHg19")) # the one we want
-ch = ah[[names(q2)]]
+ch = ah[[names(q2)[1]]]
 ```
 
 ```{r domyimport}

diff --git a/biocintro_5x/bioc1_roast.Rmd b/biocintro_5x/bioc1_roast.Rmd
@@ -31,7 +31,7 @@ Briefly, the investigators applied a glucocorticoid hormone to cultured human ai
 The groups are defined in the `characteristics_ch1.2` variable:
 
 ```{r}
-e$condition <- e$characteristics_ch1.2
+e$condition <- as.factor(e$characteristics_ch1.2)
 levels(e$condition) <- c("dex24","dex4","control")
 table(e$condition)
 ```
@@ -128,7 +128,7 @@ columns(GO.db)
 keytypes(GO.db)
 GOTERM[[rownames(r2)[1]]]
 r2tab <- select(GO.db, keys=rownames(r2)[1:10],
-                columns=c("GOID","TERM","DEFINITION"), 
+                columns=c("GOID","TERM","DEFINITION"),
                 keytype="GOID")
 r2tab[,1:2]
 ```
@@ -138,7 +138,7 @@ We can also look for the top results using the standard p-value and in the *up*
 ```{r}
 r2 <- r2[order(r2$PValue),]
 r2tab <- select(GO.db, keys=rownames(r2)[r2$Direction == "Up"][1:10],
-                columns=c("GOID","TERM","DEFINITION"), 
+                columns=c("GOID","TERM","DEFINITION"),
                 keytype="GOID")
 r2tab[,1:2]
 ```
@@ -147,7 +147,7 @@ Again but for the *down* direction.
 
 ```{r}
 r2tab <- select(GO.db, keys=rownames(r2)[r2$Direction == "Down"][1:5],
-                columns=c("GOID","TERM","DEFINITION"), 
+                columns=c("GOID","TERM","DEFINITION"),
                 keytype="GOID")
 r2tab[,1:2]
 ```

diff --git a/biocintro_5x/bioc1_t_mult.Rmd b/biocintro_5x/bioc1_t_mult.Rmd
@@ -32,11 +32,11 @@ In this section we will cover inference in the context of genome-scale experimen
 - there may be unmeasured or unreported sources of biological variation (such as time of day)
 - many features are inherently interrelated, so the tests are not independent
 
-We will apply some of the concepts we have covered in previous 
+We will apply some of the concepts we have covered in previous
 sections including t-tests and multiple comparisons; later we will
-compute standard deviation estimates from hierarchical models. 
+compute standard deviation estimates from hierarchical models.
 
-We start by loading the pooling experiment data 
+We start by loading the pooling experiment data
 
 
 ```{r,message=FALSE}
@@ -68,7 +68,7 @@ tt=rowttests(y,g)
 ```
 
 <a name="naive"></a>
-Now which genes do we report as statistically significant? For somewhat arbitrary reasons, in science p-values of 0.01 and 0.05 are used as cutoff. In this particular example we get 
+Now which genes do we report as statistically significant? For somewhat arbitrary reasons, in science p-values of 0.01 and 0.05 are used as cutoff. In this particular example we get
 
 ```{r}
 NsigAt01 = sum(tt$p.value<0.01)
@@ -90,17 +90,18 @@ set.seed(0)
 shuffledIndex <- factor(sample(c(0,1),sum(g==0),replace=TRUE ))
 nulltt <- rowttests(y[,g==0],shuffledIndex)
 NfalselySigAt01 = sum(nulltt$p.value<0.01)
-NfalselySigAt01 
+NfalselySigAt01
 NfalselySigAt05 = sum(nulltt$p.value<0.05)
 NfalselySigAt05
 ```
 
 <a name="adjusted"></a>
 
-If we use the 0.05 cutoff we will be reporting `r NfalselySigAt05` false positives. We have described several ways to adjust for this including the `qvalue` method available in the `r Biocpkg("qvalue")` package. After this adjustment we acquire 
+If we use the 0.05 cutoff we will be reporting `r NfalselySigAt05` false positives. We have described several ways to adjust for this including the `qvalue` method available in the `r Biocpkg("qvalue")` package. After this adjustment we acquire
 a smaller list of genes.
 
 ```{r}
+library(BiocStyle)
 library(qvalue)
 qvals = qvalue(tt$p.value)$qvalue
 sum(qvals<0.05)

diff --git a/highdim/distance.Rmd b/highdim/distance.Rmd
@@ -14,11 +14,11 @@ opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::
 
 The concept of distance is quite intuitive. For example, when we cluster animals into subgroups, we are implicitly defining a distance that permits us to say what animals are "close" to each other.
 
-![Clustering of animals.](images/handmade/animals.png)
+![Clustering of animals.](https://genomicsclass.github.io/book/pages/images/handmade/animals.png)
 
 Many of the analyses we perform with high-dimensional data relate directly or indirectly to distance. Many clustering and machine learning techniques rely on being able to define distance, using features or predictors. For example, to create _heatmaps_, which are widely used in genomics and other high-throughput fields, a distance is computed explicitly. 
 
-![Example of heatmap. Image Source: Heatmap, Gaeddal, 01.28.2007 Wikimedia Commons](images/handmade/Heatmap.png)
+![Example of heatmap. Image Source: Heatmap, Gaeddal, 01.28.2007 Wikimedia Commons](https://genomicsclass.github.io/book/pages/images/handmade/Heatmap.png)
 
 In these plots the measurements, which are stored in a matrix, are
 represented with colors after the columns and rows have been
@@ -126,5 +126,3 @@ as.matrix(d)[1,87]
 ```
 
 It is important to remember that if we run `dist` on `e`, it will compute all pairwise distances between genes. This will try to create a $22215 \times 22215$ matrix that may crash your R sessions.
-
-
diff --git a/modeling/hierarchical_models.Rmd b/modeling/hierarchical_models.Rmd
@@ -10,7 +10,7 @@ opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::
 
 ## Hierarchical Models
 
-In this section, we use the mathematical theory which describes an approach that has become widely applied in the analysis of high-throughput data. The general idea is to build a _hierachichal model_ with two levels. One level describes variability across samples/units, and the other describes variability across features. This is similar to the baseball example in which the first level described variability across players and the second described the randomness for the success of one player. The first level of variation is accounted for by all the models and approaches we have described here, for example the model that leads to the t-test. The second level provides power by permitting us to "borrow" information from all features to inform the inference performed on each one. 
+In this section, we use the mathematical theory which describes an approach that has become widely applied in the analysis of high-throughput data. The general idea is to build a _hierachichal model_ with two levels. One level describes variability across samples/units, and the other describes variability across features. This is similar to the baseball example in which the first level described variability across players and the second described the randomness for the success of one player. The first level of variation is accounted for by all the models and approaches we have described here, for example the model that leads to the t-test. The second level provides power by permitting us to "borrow" information from all features to inform the inference performed on each one.
 
 Here we describe one specific case that is currently the most widely
 used approach to inference with gene expression data. It is the model
@@ -20,7 +20,7 @@ adapted to develop methods for other data types such as RNAseq by, for example,
 [DESeq2](http://www.ncbi.nlm.nih.gov/pubmed/25516281). This package
 provides an alternative to the t-test that greatly improves power by
 modeling the variance. While in the baseball example we modeled averages, here we model variances. Modelling variances requires more advanced math, but the concepts are practically the same. We motivate and demonstrate the approach with
-an example. 
+an example.
 
 Here is a volcano showing effect sizes and p-value from applying a t-test to data from an experiment running six replicated samples with 16 genes artificially made to be different in two groups of three samples each. These 16 genes are the only genes for which the alternative hypothesis is true. In the plot they are shown in blue.
 
@@ -47,7 +47,7 @@ We cut-off the range of the y-axis at 4.5, but there is one blue point with a p-
 sum( p.adjust(tt$p.value,method = "BH")[spike] < 0.05)
 ```
 
-This of course has to do with the low power associated with a sample size of three in each group. The second finding is that if we forget about inference and simply rank the genes based on the size of the t-statistic, we obtain many false positives in any rank list of size larger than 1. For example, six of the top 10 genes ranked by t-statistic are false positives. 
+This of course has to do with the low power associated with a sample size of three in each group. The second finding is that if we forget about inference and simply rank the genes based on the size of the t-statistic, we obtain many false positives in any rank list of size larger than 1. For example, six of the top 10 genes ranked by t-statistic are false positives.
 
 ```{r}
 table( top50=rank(tt$p.value)<= 10, spike) #t-stat and p-val rank is the same
@@ -56,7 +56,7 @@ table( top50=rank(tt$p.value)<= 10, spike) #t-stat and p-val rank is the same
 In the plot we notice that these are mostly genes for which the effect size is relatively small, implying that the estimated standard error is small. We can confirm this with a plot:
 
 ```{r pval_versus_sd, fig.cap="p-values versus standard deviation estimates."}
-tt$s <- apply(exprs(rma95), 1, function(row) 
+tt$s <- apply(exprs(rma95), 1, function(row)
   sqrt(.5 * (var(row[1:3]) + var(row[4:6]) ) ) )
 with(tt, plot(s, -log10(p.value), cex=.8, pch=16,
               log="x",xlab="estimate of standard deviation",
@@ -70,23 +70,23 @@ $$
 s^2 \sim s_0^2 F_{d,d_0}
 $$
 
-Because we have thousands of data points, we can actually check this assumption and also estimate the parameters $s_0$ and $d_0$. This particular approach is referred to as empirical Bayes because it can be described as using data (empirical) to build the prior distribution (Bayesian approach). 
+Because we have thousands of data points, we can actually check this assumption and also estimate the parameters $s_0$ and $d_0$. This particular approach is referred to as empirical Bayes because it can be described as using data (empirical) to build the prior distribution (Bayesian approach).
 
 Now we apply what we learned with the baseball example to the standard error estimates. As before we have an observed value for each gene $s_g$, a sampling distribution as a prior distribution. We can therefore compute a posterior distribution for the variance $\sigma^2_g$ and obtain the posterior mean. You can see the details of the derivation in [this paper](http://www.ncbi.nlm.nih.gov/pubmed/16646809).
 
 $$
 \mbox{E}[\sigma^2_g \mid s_g] = \frac{d_0 s_0^2 + d s^2_g}{d_0 + d}
 $$
 
-As in the baseball example, the posterior mean _shrinks_ the observed variance $s_g^2$ towards the global variance $s_0^2$ and the weights depend on the sample size through the degrees of freedom $d$ and, in this case, the shape of the prior distribution through $d_0$. 
+As in the baseball example, the posterior mean _shrinks_ the observed variance $s_g^2$ towards the global variance $s_0^2$ and the weights depend on the sample size through the degrees of freedom $d$ and, in this case, the shape of the prior distribution through $d_0$.
 
 In the plot above we can see how the variance estimates _shrink_ for 40 genes (code not shown):
 
 
 ```{r shrinkage, fig.cap="Illustration of how estimates shrink towards the prior expectation. Forty genes spanning the range of values were selected.",message=FALSE, echo=FALSE}
 library(limma)
 fit <- lmFit(rma95, model.matrix(~ fac))
-ebfit <- ebayes(fit)
+ebfit <- eBayes(fit)
 
 n <- 40
 qs <- seq(from=0,to=.2,length=n)
@@ -105,16 +105,16 @@ An important aspect of this adjustment is that genes having a sample standard de
 ```{r volcano-plot2, message=FALSE, fig.cap="Volcano plot for moderated t-test comparing two groups. Spiked-in genes are denoted with blue. Among the rest of the genes, those with p-value < 0.01 are denoted with red.",message=FALSE}
 library(limma)
 fit <- lmFit(rma95, model.matrix(~ fac))
-ebfit <- ebayes(fit)
+ebfit <- eBayes(fit)
 limmares <- data.frame(dm=coef(fit)[,"fac2"], p.value=ebfit$p.value[,"fac2"])
 with(limmares, plot(dm, -log10(p.value),cex=.8, pch=16,
      col=cols,xlab="difference in means",
      xlim=c(-1,1), ylim=c(0,5)))
 abline(h=2,v=c(-.2,.2), lty=2)
 ```
 
-The number of false positives in the top 10 is now reduced to 2. 
+The number of false positives in the top 10 is now reduced to 2.
 
 ```{r}
-table( top50=rank(limmares$p.value)<= 10, spike) 
+table( top50=rank(limmares$p.value)<= 10, spike)
 ```