-
Notifications
You must be signed in to change notification settings - Fork 0
/
mofa.qmd
463 lines (334 loc) · 24 KB
/
mofa.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
# Integration with MOFA {#sec-mofa}
```{r}
#| child: "_setup.qmd"
```
```{r loading-packages}
#| include: false
library(targets)
library(moiraine)
library(MOFA2)
## For printing error message
library(dplyr)
library(stringr)
```
```{r setup-visible}
#| eval: false
library(targets)
library(moiraine)
## For integration with MOFA
library(MOFA2)
```
Now that the omics datasets have been appropriately pre-processed and pre-filtered, we are ready to perform the actual data integration step. In this chapter, we will show how to perform multi-omics data integration with the MOFA method from the [`MOFA2`](https://biofam.github.io/MOFA2/index.html) package.
As a reminder, here is what the `_targets.R` script should look like so far:
<details>
<summary>`_targets.R` script</summary>
```{r previous-vignettes}
#| echo: false
#| results: asis
knitr::current_input() |>
get_previous_content() |>
cat()
```
</details>
## What is MOFA?
MOFA (for Multi-Omics Factor Analysis) is a method for the unsupervised integration of two or more omics datasets. It aims at uncovering the main axes of variations shared by all or a subset of the datasets, through the construction of a small number of latent factors. It can be assimilated to a generalisation of the PCA (Principal Components Analysis) to multiple datasets. The latent factors are constructed as sparse linear combinations of the omics features, highlighting the features that contribute to each axis of variation.
MOFA uses a Bayesian framework to decompose each dataset (referred to as "view" in the package documentation) into the product of a latent factor matrix (representing to the main axes of variation) and a matrix of feature weights (indicating the extent to which the features contribute to the different latent factors). MOFA applies two levels of sparsity. The first level of sparsity is used to detect whether a given latent factor is active (i.e. explains variation) in each dataset. This allows to assess which sources of variations are shared across the datasets. The second level of sparsity is applied to the features, allowing to assess which features contribute to each source of variation.
MOFA requires as input matrices of omics measurements that must have at least some samples in common, but accepts some samples being absent from some of the datasets. It is recommended to have at least 15 samples in common across the datasets in order to obtain sensible results. The number of features in each dataset must not be too small (i.e. at least 15). It is critical that each dataset is properly normalised, with batch effects removed. In addition, it is strongly recommended that the datasets are pre-filtered to retain highly variable features; a similar number of features should be retained across the datasets. Missing values are not a problem with MOFA. MOFA can also accept samples and features metadata as data-frames. Lastly, there is a number of parameters that can be customised. The most important ones will be mentioned in this chapter; for a complete list see the [MOFA tutorial](https://raw.githack.com/bioFAM/MOFA2_tutorials/master/R_tutorials/getting_started_R.html).
::: {.callout-note}
A characteristic of the MOFA2 package is that the model training part of the analysis is done in Python with the `mofapy2` module. MOFA2 interacts with Python from R with the `reticulate` package. This is important to know as it might affect the installation and run depending on the environment in which it is used.
:::
### A note on groups in MOFA
MOFA offers the option to assign samples to groups (referred to as the multi-group framework within MOFA). It is important to understand what this does before using it. In particular, if the aim of the analysis is to find features that separate samples based on a grouping, you should not use this multi-group framework. By setting sample groups in MOFA, it will ignore sources of variations that discriminate the groups. Instead, this multi-group framework will seek sources of variations that are shared between the different groups, and sources of variations that are exclusive to a specific group. For example, in [this analysis](https://raw.githack.com/bioFAM/MOFA2_tutorials/master/R_tutorials/scNMT_gastrulation.html), the multi-group framework was used to split samples (cells) based on their developmental stage. The goal of the analysis was to find coordinated variation between the different datasets, and detect at which developmental stage (i.e. in which group) it occurred, rather than trying to differentiate the different developmental stages.
### MEFISTO
MEFISTO is an extension of MOFA that explicitly accounts for some continuous structure amongst the samples, i.e. temporal or 2D spatial relationships between the observations. The constructed latent factors' associated with the covariate is assessed, to see whether they represent smooth variation which correlates with the covariate, or whether they are independent of the covariate. The trained model can then be used to extrapolate to unseen values of the covariate (e.g. missing time-points or physical locations). When using the multi-group framework, MEFISTO can also align the values of the covariate between the different groups, for example to account for differences in developmental speed between two groups of patients.
## Creating the MOFA input
The first step is to transform the `MultiDataSet` object into a suitable format for the `MOFA2` package. This is done through the `get_input_mofa()` function. In addition to the `MultiDataSet` object, this function accepts as arguments:
- `datasets`: a vector of dataset names that dictates which datasets from the multi-omics set should be included in the analysis (by default, all datasets are included);
- `groups`: the name of the column in the samples metadata to be used as group indicator if using the multi-group framework (if not set, the multi-group framework will not be used);
- `options_list`: a list of options to customise the MOFA model -- more information below;
- `only_common_samples`: whether only the samples present in all datasets should be retained. The default value is `FALSE`; this argument should be set to `TRUE` when some datasets contain many more samples than others; when only a few samples are missing from some of the datasets, it is ok to leave it to `FALSE`.
We expand on the options that can be set with the `options_list` argument below.
### MOFA parameters
MOFA has three type of parameters (referred to as options) that the user can control.
#### Data options
These are options for data handling when creating the MOFA model. The two important options are:
- `scale_views`: logical, whether the datasets (views) should be scaled to unit variance. The default is `FALSE`, but it is recommended to set it to `TRUE` if the scale difference between the datasets is high.
- `scale_groups`: logical, whether to scale the groups to unit variance. Irrelevant unless using the multi-group framework. The default is `FALSE`, but it is recommended to set it to `TRUE` if the scale difference between the groups is high.
See the documentation for `get_default_data_options()` for a full list.
#### Model options
These are the options defining different aspects of the MOFA model. The most important ones are:
- `num_factors`: The maximum number of factors to construct (note that this is a maximum, i.e. MOFA can return a lower number of factors if there is not a lot of variation in the datasets). The default is set to 15, which is a good start. This can be adjusted after running MOFA with the default value.
- `likelihoods`: Named character vector, the type of likelihood to use for each dataset. MOFA offers three options: Gaussian likelihood (`'gaussian'`) for continuous data, Bernoulli likelihood (`'bernoulli'`) for binary data, and Poisson likelihood `('poisson')` for count data. It is highly recommended to transform the datasets in order to use a Gaussian likelihood: for example, applying a variance-stabilising transformation on RNAseq data rather than using the raw read counts with a Poisson likelihood. By default, a Gaussian likelihood is chosen for all datasets.
See the documentation for `get_default_model_options()` for a full list.
#### Training options
These are the options that control how the model is trained. The most important one is:
- `seed`: setting the random seed. This is standard practice, to make sure that the results are reproducible.
See the documentation for `get_default_training_options()` for a full list.
#### Passing the parameters to `get_input_mofa`
When creating a MOFA input object with `get_input_mofa()`, all options will be set to their default values. It is possible to customise the values for some of the options through the `options_list` parameter. This should be a named list, with up to three elements (one per type of options that MOFA accepts): `data_options`, `model_options`, and `training_options`. Each element is itself a named list, where each element of the list corresponds to a specific option. All three elements do not have to be present; for example, if we want to only specify a value for the model option `num_factors`, we can set `options_list` to:
```{r options-list-example1}
#| eval: false
list(
model_options = list(num_factors = 10)
)
```
If we also want to set the likelihoods and the random seed, then `options_list` becomes:
```{r options-list-example2}
#| eval: false
list(
model_options = list(num_factors = 10),
training_options = list(seed = 43)
)
```
For our example, we'll make sure that each dataset is scaled to unit variance, and that a Poisson likelihood is used for the genomics data, for which we have variants dosage. Since there are only 9 samples that are not present in all omics datasets, we can keep them in the analysis. We'll set the random seed to ensure that the results are reproducible:
::: {.targets-chunk}
```{targets get-input-mofa}
tar_target(
mofa_input,
get_input_mofa(
mo_presel_supervised,
options_list = list(
data_options = list(scale_views = TRUE),
model_options = list(likelihoods = c(
"snps" = "poisson",
"rnaseq" = "gaussian",
"metabolome" = "gaussian")
),
training_options = list(seed = 43)
),
only_common_samples = FALSE
)
)
```
:::
This will produce a warning:
```{r get-input-mofa-warnings}
#| echo: false
#| message: false
res <- get_input_mofa(tar_read(mo_presel_supervised),
options_list = list(
data_options = list(scale_views = TRUE),
model_options = list(likelihoods = c(
"snps" = "poisson",
"rnaseq" = "gaussian",
"metabolome" = "gaussian"
))
)
)
```
The warning informs us that we have chosen to use a Poisson likelihood for the genomics dataset, but the latter does not have integer data. This is because the missing values imputed with NIPALS-PCA (see @sec-preprocessing-missing-values) are continuous rather than discrete. This is taken care of by automatically rounding the dataset; but in other settings this warning can be an indication that the Poisson likelihood is not appropriate for the dataset.
::: {.callout-note}
When constructing the MOFA object, if there exists a column named `group` in a samples metadata table in the `MultiDataSet` object, the column will be renamed as `group_metadata` in the resulting MOFA input object. This is because MOFA needs a `group` column its own version of the samples metadata (for the multi-group framework), so there cannot be another column named `group`. Note that will have no effect on the `MultiDataSet` object, but is important to remember if you want to use some of the plotting functionalities that `MOFA2` offers.
:::
The output of the `get_input_mofa()` function is a MOFA object, more specifically an untrained model:
```{r mofa-input}
tar_load(mofa_input)
mofa_input
```
We did not use the multi-group framework, so there is only one samples group.
### MEFISTO input
In a similar way, we can construct the input object for a MEFISTO run with the `get_input_mefisto()` function. It is very similar to `get_input_mofa()`, except that is expects as second argument the name or names of the columns in the samples metadata tables corresponding to the continuous covariate(s) which represent(s) the structure amongst the samples. Note that there can be at most two covariates. For example, if we had a time-series dataset with information about the time at which each observation was taken in a column called `time`, we would use:
::: {.targets-chunk}
```{targets get-input-mefisto}
tar_target(
mefisto_input,
get_input_mefisto(
mo_presel_supervised,
"time",
options_list = list(
data_options = list(scale_views = TRUE),
model_options = list(likelihoods = c(
"snps" = "poisson",
"rnaseq" = "gaussian",
"metabolome" = "gaussian")
),
training_options = list(seed = 43)
),
only_common_samples = FALSE
)
)
```
:::
MEFISTO has some specific options that can be set in a similar way to the data, model and training options for MOFA. The most important are:
* `warping`: Logical, whether the covariate(s) should be aligned between the groups. Only when using the multi-group framework.
* `warping_ref`: Character, if using the warping option, the name of the group that should be used as reference (the covariates for the other groups will be aligned to the covariates for this group).
* `new_values`: Numeric vector, values of the covariate(s) for which the factors should be predicted (for inter/extrapolation).
To set any of these options, they can be passed as a list to the `options_list` parameter as for the data, model and training options.
## Visualising the MOFA input
It is possible to represent the samples that are present or missing across the different datasets with the `plot_data_overview()` function implemented in `MOFA2`:
```{r plot-data-overview}
plot_data_overview(mofa_input)
```
The plot generated shows the dimensions of the datasets that will be analysed.
For reporting purposes, it can also be useful to summarise the different options used to create the MOFA model. This can be done with the `options_list_as_tibble()` function, which turns a list of parameters into a tibble presenting the name and value of each parameter. For example, we can list the data options used:
```{r mofa-options-list-as-tibble}
options_list_as_tibble(mofa_input@data_options)
```
## Training the model
Once we have prepared our MOFA input, we can train the model with the `run_mofa()` function (from the `MOFA2` package). This is done through Python, so there might be some issues when trying to run the function for the first time after installing `MOFA2` if the configuration is not correct. If that is the case, see the MOFA2 tutorial (e.g. their [troubleshooting section](https://biofam.github.io/MOFA2/troubleshooting.html)) for tips on how to solve this.
By default, the function will save the resulting trained model in a temporary file, with the `.hdf5` format. It can be preferable to save the result into the output folder of your project; here as we are using targets to save the results of each step of the analysis, we do not need to do so. In addition, is it strongly recommended to set the `save_data` parameter to `TRUE`, as otherwise without the data, some of the visualisation options might not be available. Lastly, the `use_basilisk` parameter might have to be switched to `FALSE` if there are any issues with calling the Python module:
::: {.targets-chunk}
```{targets run-mofa}
tar_target(
mofa_trained,
run_mofa(
mofa_input,
save_data = TRUE,
use_basilisk = TRUE
)
)
```
:::
In our case, the following warnings are returned:
```{r run-mofa-warnings}
#| echo: false
tar_meta(fields = "warnings") |>
filter(name == "mofa_trained") |>
pull(warnings) |>
str_replace("\\.\\. ", ".\n\n") |>
warning()
```
The first one has to do with saving the model into a temporary file. We will explain the others in more detail when analysing the resulting model.
The output of the function is a trained MOFA model, with 15 latent factors:
```{r mofa-trained}
tar_load(mofa_trained)
mofa_trained
```
## Results interpretation
In @sec-interpretation, we show the different functionalities implemented in the `moiraine` package that facilitate the interpretation of the results from an integration tool. In this section, we show some of the MOFA-specific plots that can be generated to help interpret the results of a MOFA run.
### Variance explained
When training the model, MOFA2 constructs latent factors, which are sparse combination of the features, and which represent main axes of variation shared across all or a subsets of the datasets. One of the first steps to take when analysing the trained model is to assess the amount of variance in the datasets explained by each factor; similarly to what we would do when running a PCA.
The function `plot_variance_explained()` from `MOFA2` displays the percentage of variance in each dataset explained by each factor as a heatmap. The `x` and `y` arguments (and also `split_by` argument when using the multi-group framework) control whether the factors or the datasets should be represented as the rows or columns in the heatmap:
```{r plot-variance-explained}
plot_variance_explained(
mofa_trained,
x = "view", ## datasets on the x-axis
y = "factor" ## factors on the y-axis
)
```
Here, we see that factor 1 explains a lot (almost 50%) of variation in the transcriptomics dataset, a little (around 10%) in the metabolomics dataset and almost none in the genomics dataset. Factor 2 explains around 20% of variation in the genomics dataset, and none in the other datasets. Factors 3 and 4 seem specific to the transcriptomics datasets, and factors 5 to 15 explain very little variation; seeing this, we could re-train the MOFA model by setting the number of factors to 4 or 5.
In addition, the `plot_variance_explained()` function can create a second plot displaying the total percentage of variance explained by all the factors for each dataset, if the argument `plot_total` is set to `TRUE` :
```{r plot-total-variance-explained}
plot_variance_explained(
mofa_trained,
x = "view",
y = "factor",
plot_total = TRUE
)[[2]] ## show only the 2nd plot
```
The factors explain almost 85% of the variation in the transcriptomics dataset, but less than 30% in in the genomics and metabolomics datasets.
### Correlation between factors
Before moving to the interpretation, it is always good practice to check the correlation between the different computed factors. All factors should be mostly uncorrelated; if a large correlation is observed between some factors, it could be an indication of poor model fit, either because too many factors were computed (in this case, try reducing the number of factors to compute or increasing the number of iterations for the model training via the `convergence_mode` training option) or maybe because of an improper normalisation of the datasets. The function `plot_factor_cor()` from `MOFA2` displays the correlation between all factors via the `corrplot::corrplot()` function:
```{r plot-factors-correlation}
plot_factor_cor(
mofa_trained,
type = "upper",
diag = FALSE,
tl.cex = 0.8
)
```
There is a moderate positive correlation between factors 2 and 5, but nothing to be concerned about.
### Correlation between factors and samples covariates
To assess which sources of variation each factor is representing, we can check whether the factors are correlated with some known covariates that were recorded in the samples metadata. The `mofa_plot_cor_covariates()` function displays the correlation between each factor and the samples covariates. Note that factor covariates are transformed to numerical group indicators in order to compute the correlations. This function is similar to the `MOFA2` function `correlate_factors_with_covariates()` except that it returns a ggplot (rather than creating a base R plot), and offers a few convenient features through the optional arguments. By default, the function uses all covariates recorded in the samples metadata; however we can focus on a subset of them by passing the column names to the `covariates` argument.
```{r mofa-plot-cor-covariates}
mofa_plot_cor_covariates(
mofa_trained
)
```
From this plot, we can spot some interesting trends. For example, factor 1, which explains variation in the transcriptomics and metabolomics datasets, is strongly correlated with disease status. It makes sense, since the transcriptomics dataset has been filtered to retain genes most associated with differences between healthly and infected animals, so we expect it to be the strongest source of variation in the dataset. Factor 2, which explains variation only in the genomics dataset, is strongly correlated with the genomics composition of the animals, which makes a lot of sense as more related individuals share similar genomics profiles. It also makes sense that this variation is less present in the other two datasets. Factor 5, which explains variation only in the transcriptomics dataset, seems to be modestly correlated with the RNASeq batches.
In general, it is a good idea to interpret this correlation plot together with the plot representing the percentage of variation explained in the datasets by each factor. However, it is necessary to investigate more before making claims about what the factors are representing.
### Visualising the top features
The `MOFA2` package offers a number of very useful visualisations to further interpret the results. For example, for a given factor and dataset of interest, we can visualise how the top contributing features vary with the factor values (i.e. the sample scores). For example, let's have a look at the top contributing features from the transcriptomics dataset for factor 1:
```{r plot-data-scatter}
plot_data_scatter(
mofa_trained,
factor = 1,
view = "rnaseq",
features = 6,
color_by = "status",
shape_by = "feedlot"
)
```
For all six genes, their expression is higher in infected animals compared with healthy ones.
We can also visualise the measurements of the top contributing features across the samples as a heatmap, for a given dataset and factor. That the `annotation_samples` argument allows us to add annotations on top of the heatmap to represent certain characteristics of the samples. For example here we'll add information about the phenotype group and chromosome:
```{r plot-data-heatmap}
MOFA2::plot_data_heatmap(
mofa_trained,
factor = 1,
view = "rnaseq",
features = 20,
annotation_samples = "status",
fontsize_col = 5
)
```
Note that `moiraine` implements similar functions to represent the top contributing features, as we will see in @sec-interpretation.
## Recap -- targets list
For convenience, here is the list of targets that we created in this section:
<details>
<summary>Targets list for MOFA analysis</summary>
::: {.targets-chunk}
```{targets recap-targets-list}
list(
## Creating MOFA input
tar_target(
mofa_input,
get_input_mofa(
mo_presel_supervised,
options_list = list(
data_options = list(scale_views = TRUE),
model_options = list(likelihoods = c(
"snps" = "poisson",
"rnaseq" = "gaussian",
"metabolome" = "gaussian")
),
training_options = list(seed = 43)
),
only_common_samples = FALSE
)
),
## Overview plot of the samples in each dataset
tar_target(
mofa_input_plot,
plot_data_overview(mofa_input)
),
## Training MOFA model
tar_target(
mofa_trained,
run_mofa(
mofa_input,
save_data = TRUE,
use_basilisk = TRUE
)
),
## Formatting MOFA output
tar_target(
mofa_output,
get_output(mofa_trained)
),
## Plots of variance explained
tar_target(
mofa_var_explained_plot,
plot_variance_explained(
mofa_trained,
x = "view", ## datasets on the x-axis
y = "factor" ## factors on the y-axis
)
),
tar_target(
mofa_total_var_explained_plot,
plot_variance_explained(
mofa_trained,
x = "view",
y = "factor",
plot_total = TRUE
)[[2]]
),
## Plot of factors correlation with covariates
tar_target(
mofa_factors_covariates_cor_plot,
mofa_plot_cor_covariates(mofa_trained)
)
)
```
:::
</details>