-
Notifications
You must be signed in to change notification settings - Fork 3
/
corshrink.Rmd
290 lines (198 loc) · 14.2 KB
/
corshrink.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
---
title: "Adaptive Shrinkage of correlations using *CorShrink*"
shorttitle: "Corshrink Adaptive Shrinkage"
author:
- name: Kushal K Dey
affiliation:
- Department of Statistics, University of Chicago
- name: Matthew Stephens
affiliation:
- Department of Statistics, University of Chicago
- Department of Human Genetics, University of Chicago
email: [email protected]
package: CorShrink
abstract: >
Estimation of correlation or covariance matrix, especially in settings where the number of samples n is much smaller than the number of features p, has been a much studied problem in statistics. However, these popular approaches are not well suited for handling large scale missing data. Here we propose an alternative approach, *CorShrink*, for covariance or correlation estimation that adapts to varying degree of missingness in observations corresponding to each pair of features. We show the different formulations of *Corshrink* and its flexibility to different types of data formats through examples.
output:
html_document:
toc: true
toc_float: true
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8} ---
date: "`r Sys.Date()`"
bibliography: corshrink.bib
---
```{r knitr, echo=FALSE, results="hide"}
library("knitr")
opts_chunk$set(tidy=FALSE,tidy.opts=list(width.cutoff=30),dev="png",fig.show="hide",
fig.width=4,fig.height=7,
message=FALSE, warning = FALSE)
```
# Introduction
Estimation of covariance or correlation matrices has widespread usage in a broad spectrum of statistical applications. The most commonly used estimator, namely the sample covariance or correlation matrix, is rank deficient and hence unstable in cases where the dimensionality of the problem (p) is greater than the number of samples (n). This problem has driven statisticians to suggest various alternative estimators in \textit{small n, large p} settings. Several estimators of correlation matrix have been proposed in such settings and their theoretical properties and performance comparisons have been studied comprehensively [@touloumis2015nonparametric, @ledoit2003improved, @bickel2008, @rothman2009]. Some of these methods are already available as R packages - **corpcor** [@schafer2004empirical], **glasso** [@friedman2008sparse], **PDSCE** [@rothman2009] etc.
These approaches, however, are not well suited for handling large scale missingness in data. Also, some of these methods work well under some specific sets of assumptions about the underlying matrix, for e.g. - thresholding estimators assume a banded structure of the correlation matrix. In this package, ee introduce a method *CorShrink* that adapts to varying degree of missingness in observations corresponding to each pair of features. Also, *CorShrink* can be applied directly to data consisting of missing values, as well as to derived quantities like vectors and matrices of correlations between features, and allows for two formulations - an asymptotic approach and a resampling based approach. Even in examples with no missing data, CorShrink estimated correlations are visibly closer to the true correlations compared to the standard methods. *CorShrink* also can be applied to other correlation-like quantities such as partial correlations, rank correlations and cosine similarity values from word2vec models.
---
# CorShrink Installation
**CorShrink** is a companion package to **ashr** R package [@stephens2016false]. Before installing **CorShrink**, please make sure you have the latest version of **ashr**.
```{r install_ash, eval=FALSE}
install.packages("ashr")
```
The other dependencies of this package include **SQUAREM**, **reshape2** and **Matrix**. Next we install **CorShrink**.
```{r install_corshrink_CRAN, eval=FALSE}
install.packages("CorShrink")
```
The development version can be installed from Github as well.
```{r install_corshrink_github, eval=FALSE}
library(devtools)
install_github("kkdey/CorShrink")
```
Then load the package with:
```{r load_countclust, cache=FALSE, eval=TRUE,warning=FALSE}
library(CorShrink)
```
---
# Methods
The main steps in **CorShrink** are as follows
- Convert the correlation between variables $i$ and $j$, $R_{ij}$ into Fisher z-score $Z_{ij}$ by the following transformation.
$$ Z_{ij} = 0.5 \log \left (\frac{1 + R_{ij}}{1 - R_{ij}} \right ) $$
- Estimate from data, the standard errors ($s_{ij}$) of these Z-scores $Z_{ij}$. This can be done in two ways in **CorShrink**. One approach uses an asymptotic normal approximation, where the standard errors are $s_{ij} = \frac{1}{n_{ij} - 3}$, with $n_{ij}$ being the number of complete observations between pair $(i,j)$ that generates the correlation $R_{ij}$. The other approach performs a re-sampling of the observations for the $(i,j)$ pair and obtains a Bootstrap estimate of the standard errors from the re-sampled $Z_{ij}$.
- Apply adaptive shrinkage (*ash* due [@stephens2016false]) on the pairs $(Z_{ij}, s_{ij})$ either across all $i$ and $j$ pairs (matrix format) or along all $i$ for one $j$, or along all $j$ for one $i$ (vector formats).
$$ Z^{\star}_{ij} : = ash \; (Z_{ij}, s_{ij}) $$
The matrix format shrinkage is performed by the *CorShrinkMatrix* function while the vector format shrinkage is performed by the *CorShrinkVector* function.
- Reverse transform the posterior mean of the shrunk Fisher z-scores from the previous step.
$$ R^{\star}_{ij} = \frac{exp \; (2 Z^{\star}_{ij}) - 1}{exp \; (2 Z^{\star}_{ij}) + 1} $$
- If the user is attempting to estimate a correlation matrix ((R_{ij})) for a number of variables $(i,j)$, then a requisite condition is positive definiteness of the estimate. However, the $((R^{\star}_{ij}))$ matrix obtained above may not be positive definite. We work around this problem by choosing a positive definite matrix $((R^{{\star}{\star}}_{ij}))$ nearest to the above estimate (see [@higham2002computing]).
---
# Illustration
We load an example data matrix - the person (544) by tissue samples (53) gene expression data for the gene *ENSG00000166819* collected from the [Genotype Tissue Expression (GTEx) Project](https://www.gtexportal.org/home/) .
```{r load_data, eval=TRUE, warning = FALSE}
data("sample_by_feature_data")
```
Just by checking the first few rows and columns, we see that the data contains many missing values. The data is
```{r top_data, eval=TRUE, warning = FALSE}
sample_by_feature_data[1:5,1:5]
```
## Standard version *CorShrink*
### *CorShrinkData*
We estimate the adaptively shrunk correlation matrix for this data using **CorShrink**.
```{r corshrink_data, eval=TRUE, warning = FALSE, message=FALSE, fig.align = "left", fig.show="asis", dpi=50, fig.width=13, fig.height=10}
out <- CorShrinkData(sample_by_feature_data, sd_boot = FALSE, image = "both",
image.control = list(tl.cex = 0.8))
```
The function outputs a list with two elements which are two versions of CorShrink estimated matrices - `cor` and `cor_before_PD`.`cor` is the nearest positive definite approximation ($R^{{\star}{\star}}$) to `cor_before_PD` version ($R^{{\star}}$) as described in the methods above. When `image = "both"`, the function plots the images for both these versions.
To see whether the method works well, check if the these two versions are close to each other.
```{r corshrink_output, eval = TRUE, warning = FALSE}
out$cor_before_PD[1:5,1:5]
out$cor[1:5, 1:5]
```
### *CorShrinkMatrix*
**CorShrink** takes as input not just the samples by features data matrix but also a matrix of pairwise correlations with a matrix of number of samples for each pair contributing to the correlation.
```{r corshrinkmat, eval=TRUE, warning = FALSE, warning = FALSE, message=FALSE, fig.align = "left", fig.show="asis", dpi=50, fig.width=13, fig.height=10}
data("pairwise_corr_matrix")
data("common_samples")
out <- CorShrinkMatrix(pairwise_corr_matrix, common_samples, image = "both",
image.control = list(tl.cex = 0.8))
```
### *CorShrinkVector*
**CorShrink** can be applied to vectors of correlations as well.
```{r corshrink_vec, eval=TRUE, warning = FALSE, message = FALSE}
cor_vec <- c(-0.56, -0.4, 0.02, 0.2, 0.9, 0.8, 0.3, 0.1, 0.4)
nsamp_vec <- c(10, 20, 30, 4, 50, 60, 20, 10, 3)
out <- CorShrinkVector(corvec = cor_vec, nsamp_vec = nsamp_vec)
out
```
Note that the correlations computed from adequate amount of data as for the 5th and 6th entries above, the amount of shrinkage is minimal, while it is substantial for the 4th and 9th entries which correspond to small number of samples.
## Re-sampling version *CorShrink*
We have so far looked at *CorShrinkData*, *CorShrinkMatrix* and *CorShrinkVector*, three functions that provide adaptive shrinkage of correlations at the level of the data matrix, matrix of correlations and vector of correlations respectively. In the above examples, we have used the asymptotic version of our algorithm (see Methods). Next we show example usage of a resampling based version of *CorShrink*.
### *CorShrinkData* - resampling
```{r corshrink_data_boot, eval=TRUE, warning = FALSE, message=FALSE, fig.align = "left", fig.show="asis", dpi=50, fig.width=13, fig.height=10}
out <- CorShrinkData(sample_by_feature_data, sd_boot = TRUE, image = "both",
image.control = list(tl.cex = 0.8))
```
The algorithm works by first computing a Bootstrap estimate of the standard error of the Fisher z-scores for each pair and then using this estimate together with the correlations to shrink the latter.
### *CorShrinkMatrix* - resampling
The breakdown can be formulated at the level of a correlation matrix as follows.
```{r corshrink_mat_boot, eval=TRUE, warning = FALSE, message=FALSE, fig.align = "left", fig.show="asis", dpi=50, fig.width=13, fig.height=10}
zscoreSDmat <- bootcorSE_calc(sample_by_feature_data, verbose = FALSE)
out <- CorShrinkMatrix(pairwise_corr_matrix, zscore_sd = zscoreSDmat, image = "both",
image.control = list(tl.cex = 0.8))
```
## *pCorShrink*
We can use the *pCorShrinkData* function to adaptively shrink partial correlations. This approach is
analogous to GLASSO, CLIME and other sparse graphical model methods, but **pCorShrinkData** shrinks the edge weights in the graph to 0 adaptively.
As per current implementation *pCorShrinkData* does not handle missing observations in the data matrix. So, we demonstrate the use of this method on a fully observed simulated data.
### Simulated setting
```{r warning = FALSE, message=FALSE, fig.align = "left", fig.show="asis", dpi=50, fig.width=18, fig.height=15}
library(Matrix)
n <- 500
P <- 100
block <- 10
mat <- 0.3*diag(1,block) + 0.7*rep(1,block) %*% t(rep(1, block))
Sigma <- Matrix::bdiag(mat, mat, mat, mat, mat, mat, mat, mat, mat, mat)
corSigma <- cov2cor(Sigma)
pcorSigma <- corpcor::cor2pcor(corSigma) ## true partial correlation matrix
```
### Generate Data
```{r}
################## Generate data ################
data <- MASS::mvrnorm(n,rep(0,P),Sigma)
```
### pCorShrinkData
```{r warning = FALSE, message=FALSE, fig.align = "left", fig.show="asis", dpi=50, fig.width=18, fig.height=15}
out1 <- pCorShrinkData(data, reg_type = "lm")
```
### Visualization
```{r warning = FALSE, message=FALSE, fig.align = "left", fig.show="asis", dpi=50, fig.width=10, fig.height=8}
library(corrplot)
col2 <- c("blue", "white", "red")
par(mfrow=c(1,2))
corrplot::corrplot(pcorSigma, diag = FALSE,
col = colorRampPalette(col2)(200),
tl.pos = "td", tl.cex = 0.2, tl.col = "black",
rect.col = "white",na.label.col = "white", mar=c(2,2,2,2),
method = "color", type = "upper", title = "original")
corrplot::corrplot(out1, diag = FALSE,
col = colorRampPalette(col2)(200),
tl.pos = "td", tl.cex = 0.2, tl.col = "black",
rect.col = "white",na.label.col = "white", mar=c(2,2,2,2),
method = "color", type = "upper", title = "pCorShrink")
```
---
# Extras
So far, in all our examples, we assumed that the estimated correlations between any pair of variables is shrunk towards 0. But *CorShrink* allows the user to choose a non-zero shrinkage target, estimated from the data, using the `mode` option in `ash.control` input.
One can choose a fixed non-zero target in `mode` as well.
```{r extras_1, eval=TRUE, warning = FALSE, message=FALSE, fig.align = "left", fig.show="asis", dpi=50, fig.width=13, fig.height=10}
par(mfrow=c(1,2))
out1 <- CorShrinkData(sample_by_feature_data, sd_boot = FALSE, image = "corshrink", image.control = list(title = "CorShrink (target = 0)", tl.cex = 0.8))
out2 <- CorShrinkData(sample_by_feature_data, sd_boot = FALSE, image = "corshrink", ash.control = list(mode = "estimate"),
image.control = list(title = "CorShrink (target = estimated)", tl.cex = 0.8))
```
The `image = "output"` option just outputs the image for the shrunk matrix without plotting it.
In general, *CorShrink* assumes a normal prior for the population Fisher z-scores. But under specific settings, a non-symmetric distribution , such as uniform or half-uniform could be a better fit. This can be achieved using the `mixcompdist` in `ash.control`.
```{r extras_2, eval=TRUE, warning = FALSE, message=FALSE, fig.align = "left", fig.show="asis", dpi=50, fig.width=13, fig.height=10}
par(mfrow=c(2,2))
out1 <- CorShrinkData(sample_by_feature_data,sd_boot = FALSE, image ="corshrink",
ash.control = list(mixcompdist = "normal"),
image.control = list(tl.cex = 0.6))
out2 <- CorShrinkData(sample_by_feature_data,sd_boot = FALSE, image ="corshrink",
ash.control = list(mixcompdist = "uniform"),
image.control = list(tl.cex = 0.6))
out3 <- CorShrinkData(sample_by_feature_data,sd_boot = FALSE, image ="corshrink",
ash.control = list(mixcompdist = "halfuniform"),
image.control = list(tl.cex = 0.6))
out4 <- CorShrinkData(sample_by_feature_data,sd_boot = FALSE, image ="corshrink",
ash.control = list(mixcompdist = "+uniform"),
image.control = list(tl.cex = 0.6))
```
---
# Acknowledgements
We would like to thank the GTEx Consortium, John Blischak, Sarah Urbut, Chiaowen Joyce Hsiao, Peter Carbonetto and all members of the Stephens Lab.
---
# Session Info
```{r session_info, eval=TRUE}
sessionInfo()
```
---
# References