-
Notifications
You must be signed in to change notification settings - Fork 0
/
extracting_from_data_model.Rmd
143 lines (112 loc) · 4.5 KB
/
extracting_from_data_model.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
title: "Extracting data from the GREGoR data model"
output: html_notebook
---
This notebook has examples of how to extract data from workspace tables in the
[GREGoR data model](https://github.com/UW-GAC/gregor_data_models). We use the
[AnVIL R package](https://bioconductor.org/packages/AnVIL) to retrieve the
data tables, and the [dplyr package](https://dplyr.tidyverse.org) to join tables
in R. Both of these packages are installed in the AnVIL RStudio and Jupyter
cloud environments.
# Identify a participant and family members
Let's say we are interested in retrieving data for a participant as well as all
family members. We can use the "participant" table to link all family members
by their family_id.
We read tables from a workspace with the `avtable` function.
```{r}
library(AnVIL)
library(dplyr)
workspace <- "GREGOR_COMBINED_CONSORTIUM_U07"
namespace <- "gregor-dcc"
participant <- avtable("participant", name=workspace, namespace=namespace)
```
Find the family_id for the participant, then filter the participant table to
all members of this family.
```{r}
my_participant_id <- "PMGRC-7-5-2"
my_family <- participant %>%
filter(participant_id == my_participant_id) %>%
select(family_id) %>%
left_join(participant, by="family_id")
my_family
```
Find phenotypes for all family members.
```{r}
phenotype <- avtable("phenotype", name=workspace, namespace=namespace)
my_phenotype <- my_family %>%
select(participant_id) %>%
left_join(phenotype, by="participant_id")
my_phenotype
```
Use the family table to find more details about the family.
```{r}
family <- avtable("family", name=workspace, namespace=namespace)
my_family_details <- family %>%
filter(family_id == unique(my_family$family_id))
my_family_details
```
# Find all analytes for participants in the family
The analyte table has information about the tissue that was used for the
experiments. We can filter this table by participant_id.
```{r}
analyte <- avtable("analyte", name=workspace, namespace=namespace)
my_analyte <- analyte %>%
filter(participant_id %in% my_family$participant_id)
my_analyte
```
# Find experiments for all participants in the family
In the combined consortium workspace, the "experiment" table
includes all experiments across data types and links them to participant ids,
making it more straightforward to find experiments for a given participant. We
still need to query the experiment tables for each data type represented to
find details on the experiments.
```{r}
experiment <- avtable("experiment", name=workspace, namespace=namespace)
my_experiment <- experiment %>%
filter(participant_id %in% my_family$participant_id)
my_experiment
experiment_table_names <- unique(my_experiment$table_name)
experiment_tables <- lapply(experiment_table_names, avtable, name=workspace, namespace=namespace)
names(experiment_tables) <- experiment_table_names
experiment_details <- list()
for (t in experiment_table_names) {
experiment_details[[t]] <- my_experiment %>%
filter(table_name == t) %>%
select(participant_id, id_in_table) %>%
left_join(experiment_tables[[t]], by=c("id_in_table"=paste0(t, "_id")))
}
experiment_details
```
# Find all data files associated with a family
In the combined consortium workspace, the "aligned" table
includes all aligned read files across data types and links them to participant ids,
making it more straightforward to find files for a given participant. We
still need to query the aligned tables for each data type represented to
find details on the alignment.
```{r}
aligned <- avtable("aligned", name=workspace, namespace=namespace)
my_aligned <- aligned %>%
filter(participant_id %in% my_family$participant_id)
my_aligned
aligned_table_names <- unique(my_aligned$table_name)
aligned_tables <- lapply(aligned_table_names, avtable, name=workspace, namespace=namespace)
names(aligned_tables) <- aligned_table_names
aligned_details <- list()
for (t in aligned_table_names) {
aligned_details[[t]] <- my_aligned %>%
filter(table_name == t) %>%
select(participant_id, id_in_table) %>%
left_join(aligned_tables[[t]], by=c("id_in_table"=paste0(t, "_id")))
}
aligned_details
```
# Export a data table to use IGV
We can add a new table to the workspace with a selection of files we'd like to open in IGV.
Since each table in AnVIL must have a primary key <table_name>_id, we create an
index for a table called "example" before importing it.
```{r}
aligned_details$aligned_dna_short_read %>%
tibble::rownames_to_column() %>%
rename(example_id = rowname) %>%
avtable_import(entity="example_id")
```