Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes from edX revision project #104

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion biocintro_5x/bioc1_liftOver.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ ah = AnnotationHub()
q1 = query(ah, c("chain")) # list all resources with 'chain' in metadata
q1
q2 = query(ah, c("chain", "hg38ToHg19")) # the one we want
ch = ah[[names(q2)]]
ch = ah[[names(q2)[1]]]
```

```{r domyimport}
Expand Down
8 changes: 4 additions & 4 deletions biocintro_5x/bioc1_roast.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Briefly, the investigators applied a glucocorticoid hormone to cultured human ai
The groups are defined in the `characteristics_ch1.2` variable:

```{r}
e$condition <- e$characteristics_ch1.2
e$condition <- as.factor(e$characteristics_ch1.2)
levels(e$condition) <- c("dex24","dex4","control")
table(e$condition)
```
Expand Down Expand Up @@ -128,7 +128,7 @@ columns(GO.db)
keytypes(GO.db)
GOTERM[[rownames(r2)[1]]]
r2tab <- select(GO.db, keys=rownames(r2)[1:10],
columns=c("GOID","TERM","DEFINITION"),
columns=c("GOID","TERM","DEFINITION"),
keytype="GOID")
r2tab[,1:2]
```
Expand All @@ -138,7 +138,7 @@ We can also look for the top results using the standard p-value and in the *up*
```{r}
r2 <- r2[order(r2$PValue),]
r2tab <- select(GO.db, keys=rownames(r2)[r2$Direction == "Up"][1:10],
columns=c("GOID","TERM","DEFINITION"),
columns=c("GOID","TERM","DEFINITION"),
keytype="GOID")
r2tab[,1:2]
```
Expand All @@ -147,7 +147,7 @@ Again but for the *down* direction.

```{r}
r2tab <- select(GO.db, keys=rownames(r2)[r2$Direction == "Down"][1:5],
columns=c("GOID","TERM","DEFINITION"),
columns=c("GOID","TERM","DEFINITION"),
keytype="GOID")
r2tab[,1:2]
```
Expand Down
13 changes: 7 additions & 6 deletions biocintro_5x/bioc1_t_mult.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,11 @@ In this section we will cover inference in the context of genome-scale experimen
- there may be unmeasured or unreported sources of biological variation (such as time of day)
- many features are inherently interrelated, so the tests are not independent

We will apply some of the concepts we have covered in previous
We will apply some of the concepts we have covered in previous
sections including t-tests and multiple comparisons; later we will
compute standard deviation estimates from hierarchical models.
compute standard deviation estimates from hierarchical models.

We start by loading the pooling experiment data
We start by loading the pooling experiment data


```{r,message=FALSE}
Expand Down Expand Up @@ -68,7 +68,7 @@ tt=rowttests(y,g)
```

<a name="naive"></a>
Now which genes do we report as statistically significant? For somewhat arbitrary reasons, in science p-values of 0.01 and 0.05 are used as cutoff. In this particular example we get
Now which genes do we report as statistically significant? For somewhat arbitrary reasons, in science p-values of 0.01 and 0.05 are used as cutoff. In this particular example we get

```{r}
NsigAt01 = sum(tt$p.value<0.01)
Expand All @@ -90,17 +90,18 @@ set.seed(0)
shuffledIndex <- factor(sample(c(0,1),sum(g==0),replace=TRUE ))
nulltt <- rowttests(y[,g==0],shuffledIndex)
NfalselySigAt01 = sum(nulltt$p.value<0.01)
NfalselySigAt01
NfalselySigAt01
NfalselySigAt05 = sum(nulltt$p.value<0.05)
NfalselySigAt05
```

<a name="adjusted"></a>

If we use the 0.05 cutoff we will be reporting `r NfalselySigAt05` false positives. We have described several ways to adjust for this including the `qvalue` method available in the `r Biocpkg("qvalue")` package. After this adjustment we acquire
If we use the 0.05 cutoff we will be reporting `r NfalselySigAt05` false positives. We have described several ways to adjust for this including the `qvalue` method available in the `r Biocpkg("qvalue")` package. After this adjustment we acquire
a smaller list of genes.

```{r}
library(BiocStyle)
library(qvalue)
qvals = qvalue(tt$p.value)$qvalue
sum(qvals<0.05)
Expand Down
6 changes: 2 additions & 4 deletions highdim/distance.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,11 @@ opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::

The concept of distance is quite intuitive. For example, when we cluster animals into subgroups, we are implicitly defining a distance that permits us to say what animals are "close" to each other.

![Clustering of animals.](images/handmade/animals.png)
![Clustering of animals.](https://genomicsclass.github.io/book/pages/images/handmade/animals.png)

Many of the analyses we perform with high-dimensional data relate directly or indirectly to distance. Many clustering and machine learning techniques rely on being able to define distance, using features or predictors. For example, to create _heatmaps_, which are widely used in genomics and other high-throughput fields, a distance is computed explicitly.

![Example of heatmap. Image Source: Heatmap, Gaeddal, 01.28.2007 Wikimedia Commons](images/handmade/Heatmap.png)
![Example of heatmap. Image Source: Heatmap, Gaeddal, 01.28.2007 Wikimedia Commons](https://genomicsclass.github.io/book/pages/images/handmade/Heatmap.png)

In these plots the measurements, which are stored in a matrix, are
represented with colors after the columns and rows have been
Expand Down Expand Up @@ -126,5 +126,3 @@ as.matrix(d)[1,87]
```

It is important to remember that if we run `dist` on `e`, it will compute all pairwise distances between genes. This will try to create a $22215 \times 22215$ matrix that may crash your R sessions.


20 changes: 10 additions & 10 deletions modeling/hierarchical_models.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::

## Hierarchical Models

In this section, we use the mathematical theory which describes an approach that has become widely applied in the analysis of high-throughput data. The general idea is to build a _hierachichal model_ with two levels. One level describes variability across samples/units, and the other describes variability across features. This is similar to the baseball example in which the first level described variability across players and the second described the randomness for the success of one player. The first level of variation is accounted for by all the models and approaches we have described here, for example the model that leads to the t-test. The second level provides power by permitting us to "borrow" information from all features to inform the inference performed on each one.
In this section, we use the mathematical theory which describes an approach that has become widely applied in the analysis of high-throughput data. The general idea is to build a _hierachichal model_ with two levels. One level describes variability across samples/units, and the other describes variability across features. This is similar to the baseball example in which the first level described variability across players and the second described the randomness for the success of one player. The first level of variation is accounted for by all the models and approaches we have described here, for example the model that leads to the t-test. The second level provides power by permitting us to "borrow" information from all features to inform the inference performed on each one.

Here we describe one specific case that is currently the most widely
used approach to inference with gene expression data. It is the model
Expand All @@ -20,7 +20,7 @@ adapted to develop methods for other data types such as RNAseq by, for example,
[DESeq2](http://www.ncbi.nlm.nih.gov/pubmed/25516281). This package
provides an alternative to the t-test that greatly improves power by
modeling the variance. While in the baseball example we modeled averages, here we model variances. Modelling variances requires more advanced math, but the concepts are practically the same. We motivate and demonstrate the approach with
an example.
an example.

Here is a volcano showing effect sizes and p-value from applying a t-test to data from an experiment running six replicated samples with 16 genes artificially made to be different in two groups of three samples each. These 16 genes are the only genes for which the alternative hypothesis is true. In the plot they are shown in blue.

Expand All @@ -47,7 +47,7 @@ We cut-off the range of the y-axis at 4.5, but there is one blue point with a p-
sum( p.adjust(tt$p.value,method = "BH")[spike] < 0.05)
```

This of course has to do with the low power associated with a sample size of three in each group. The second finding is that if we forget about inference and simply rank the genes based on the size of the t-statistic, we obtain many false positives in any rank list of size larger than 1. For example, six of the top 10 genes ranked by t-statistic are false positives.
This of course has to do with the low power associated with a sample size of three in each group. The second finding is that if we forget about inference and simply rank the genes based on the size of the t-statistic, we obtain many false positives in any rank list of size larger than 1. For example, six of the top 10 genes ranked by t-statistic are false positives.

```{r}
table( top50=rank(tt$p.value)<= 10, spike) #t-stat and p-val rank is the same
Expand All @@ -56,7 +56,7 @@ table( top50=rank(tt$p.value)<= 10, spike) #t-stat and p-val rank is the same
In the plot we notice that these are mostly genes for which the effect size is relatively small, implying that the estimated standard error is small. We can confirm this with a plot:

```{r pval_versus_sd, fig.cap="p-values versus standard deviation estimates."}
tt$s <- apply(exprs(rma95), 1, function(row)
tt$s <- apply(exprs(rma95), 1, function(row)
sqrt(.5 * (var(row[1:3]) + var(row[4:6]) ) ) )
with(tt, plot(s, -log10(p.value), cex=.8, pch=16,
log="x",xlab="estimate of standard deviation",
Expand All @@ -70,23 +70,23 @@ $$
s^2 \sim s_0^2 F_{d,d_0}
$$

Because we have thousands of data points, we can actually check this assumption and also estimate the parameters $s_0$ and $d_0$. This particular approach is referred to as empirical Bayes because it can be described as using data (empirical) to build the prior distribution (Bayesian approach).
Because we have thousands of data points, we can actually check this assumption and also estimate the parameters $s_0$ and $d_0$. This particular approach is referred to as empirical Bayes because it can be described as using data (empirical) to build the prior distribution (Bayesian approach).

Now we apply what we learned with the baseball example to the standard error estimates. As before we have an observed value for each gene $s_g$, a sampling distribution as a prior distribution. We can therefore compute a posterior distribution for the variance $\sigma^2_g$ and obtain the posterior mean. You can see the details of the derivation in [this paper](http://www.ncbi.nlm.nih.gov/pubmed/16646809).

$$
\mbox{E}[\sigma^2_g \mid s_g] = \frac{d_0 s_0^2 + d s^2_g}{d_0 + d}
$$

As in the baseball example, the posterior mean _shrinks_ the observed variance $s_g^2$ towards the global variance $s_0^2$ and the weights depend on the sample size through the degrees of freedom $d$ and, in this case, the shape of the prior distribution through $d_0$.
As in the baseball example, the posterior mean _shrinks_ the observed variance $s_g^2$ towards the global variance $s_0^2$ and the weights depend on the sample size through the degrees of freedom $d$ and, in this case, the shape of the prior distribution through $d_0$.

In the plot above we can see how the variance estimates _shrink_ for 40 genes (code not shown):


```{r shrinkage, fig.cap="Illustration of how estimates shrink towards the prior expectation. Forty genes spanning the range of values were selected.",message=FALSE, echo=FALSE}
library(limma)
fit <- lmFit(rma95, model.matrix(~ fac))
ebfit <- ebayes(fit)
ebfit <- eBayes(fit)

n <- 40
qs <- seq(from=0,to=.2,length=n)
Expand All @@ -105,16 +105,16 @@ An important aspect of this adjustment is that genes having a sample standard de
```{r volcano-plot2, message=FALSE, fig.cap="Volcano plot for moderated t-test comparing two groups. Spiked-in genes are denoted with blue. Among the rest of the genes, those with p-value < 0.01 are denoted with red.",message=FALSE}
library(limma)
fit <- lmFit(rma95, model.matrix(~ fac))
ebfit <- ebayes(fit)
ebfit <- eBayes(fit)
limmares <- data.frame(dm=coef(fit)[,"fac2"], p.value=ebfit$p.value[,"fac2"])
with(limmares, plot(dm, -log10(p.value),cex=.8, pch=16,
col=cols,xlab="difference in means",
xlim=c(-1,1), ylim=c(0,5)))
abline(h=2,v=c(-.2,.2), lty=2)
```

The number of false positives in the top 10 is now reduced to 2.
The number of false positives in the top 10 is now reduced to 2.

```{r}
table( top50=rank(limmares$p.value)<= 10, spike)
table( top50=rank(limmares$p.value)<= 10, spike)
```
Loading