Skip to content

Latest commit

 

History

History
1573 lines (1423 loc) · 37.1 KB

utilities.md

File metadata and controls

1573 lines (1423 loc) · 37.1 KB

Utilities

Swan now comes with several utilities that can be used fo compute and output various metrics using data in the SwanGraph.

Table of contents

We'll be using the same SwanGraph as the rest of the tutorial pages to demonstrate these utilities. Load it using the following code:

import swan_vis as swan

# code to download this data is in the Getting started tutorial
sg = swan.read('../tutorials/data/swan.p')
Read in graph from ../tutorials/data/swan.p

Calculating TPM values

Swan allows for users to calculate the TPM of their data using various groupby metrics using the calc_tpm() function. You can use this to calculate TPM of any of the AnnData SwanGraph objects (SwanGraph.adata for transcripts, SwanGraph.tss_adata for TSSs, SwanGraph.tes_adata for TESs, and SwanGraph.edge_adata for edges; see the Data structure FAQ page for more information on these tables.

First, we'll calculate the TPM for each transcript in each dataset:

df = swan.calc_tpm(sg.adata)
df.head()
tid ENST00000000233.9 ENST00000000412.7 ENST00000000442.10 ENST00000001008.5 ENST00000001146.6 ENST00000002125.8 ENST00000002165.10 ENST00000002501.10 ENST00000002596.5 ENST00000002829.7 ... TALONT000482711 TALONT000482903 TALONT000483195 TALONT000483284 TALONT000483315 TALONT000483322 TALONT000483327 TALONT000483978 TALONT000484004 TALONT000484796
hepg2_1 196.138474 86.060760 8.005652 46.032497 0.0 16.011305 258.182281 60.042389 0.0 0.0 ... 0.000000 4.002826 2.001413 12.008478 0.000000 0.000000 4.002826 14.009891 8.005652 0.000000
hepg2_2 243.975174 77.789185 7.071744 61.288448 0.0 12.964864 380.695557 64.824318 0.0 0.0 ... 1.178624 14.143488 4.714496 7.071744 2.357248 8.250368 2.357248 11.786240 10.607616 1.178624
hffc6_1 131.320969 194.355042 0.000000 107.683197 0.0 6.566049 278.400452 0.000000 0.0 0.0 ... 6.566049 13.132097 9.192468 1.313210 6.566049 9.192468 6.566049 0.000000 15.758516 1.313210
hffc6_2 137.061584 242.395935 0.000000 124.370689 0.0 8.883621 219.552338 0.000000 0.0 0.0 ... 15.229064 10.152709 6.345443 8.883621 1.269089 10.152709 15.229064 0.000000 16.498154 8.883621
hffc6_3 147.986496 273.205841 3.252450 172.379868 0.0 9.757351 200.025696 1.626225 0.0 0.0 ... 14.636026 11.383576 8.131125 8.131125 11.383576 11.383576 6.504900 0.000000 24.393377 9.757351

5 rows × 208306 columns

We can swap out the first argument with the different AnnData structures in the SwanGraph. For instance, say we want to calculate the TPM of each TSS:

df = swan.calc_tpm(sg.tss_adata)
df.head()
tss_id ENSG00000000003.14_1 ENSG00000000003.14_2 ENSG00000000003.14_3 ENSG00000000003.14_4 ENSG00000000005.5_1 ENSG00000000005.5_2 ENSG00000000419.12_1 ENSG00000000419.12_2 ENSG00000000457.13_1 ENSG00000000457.13_2 ... TALONG000085596_1 TALONG000085799_1 TALONG000085978_1 TALONG000086022_1 TALONG000086057_1 TALONG000086218_1 TALONG000086443_1 TALONG000086539_1 TALONG000086553_1 TALONG000086766_1
hepg2_1 0.0 232.163910 0.0 0.0 0.0 0.0 0.0 54.038151 0.0 0.000000 ... 0.000000 0.000000 0.000000 60.042389 0.000000 0.000000 6.004239 0.000000 0.000000 8.005652
hepg2_2 0.0 276.976654 0.0 0.0 0.0 0.0 0.0 103.718910 0.0 2.357248 ... 0.000000 0.000000 0.000000 95.468544 0.000000 0.000000 31.822847 0.000000 1.178624 10.607616
hffc6_1 0.0 45.962341 0.0 0.0 0.0 0.0 0.0 101.117149 0.0 0.000000 ... 2.626419 6.566049 9.192468 0.000000 7.879258 11.818888 233.751328 9.192468 6.566049 15.758516
hffc6_2 0.0 53.301723 0.0 0.0 0.0 0.0 0.0 85.028938 0.0 1.269089 ... 6.345443 1.269089 12.690886 0.000000 8.883621 20.305418 119.294334 12.690886 2.538177 16.498154
hffc6_3 0.0 68.301460 0.0 0.0 0.0 0.0 0.0 89.442383 0.0 1.626225 ... 8.131125 8.131125 11.383576 0.000000 11.383576 17.888477 134.976685 27.645828 8.131125 24.393377

5 rows × 130176 columns

And finally, we can use an alternative metadata column to compute TPM on. For instance, we can use the cell_line column:

df = swan.calc_tpm(sg.adata, obs_col='cell_line')
df.head()
tid ENST00000000233.9 ENST00000000412.7 ENST00000000442.10 ENST00000001008.5 ENST00000001146.6 ENST00000002125.8 ENST00000002165.10 ENST00000002501.10 ENST00000002596.5 ENST00000002829.7 ... TALONT000482711 TALONT000482903 TALONT000483195 TALONT000483284 TALONT000483315 TALONT000483322 TALONT000483327 TALONT000483978 TALONT000484004 TALONT000484796
hepg2 226.245346 80.854897 7.417881 55.634102 0.0 14.093972 335.288177 63.051983 0.0 0.0 ... 0.741788 10.385033 3.70894 8.901457 1.483576 5.192516 2.967152 12.610396 9.643245 0.741788
hffc6 138.145737 234.247116 0.924052 132.139404 0.0 8.316465 234.709137 0.462026 0.0 0.0 ... 12.012672 11.550647 7.85444 6.006336 6.006336 10.164569 9.702543 0.000000 18.481035 6.468362

2 rows × 208306 columns

Calculating pi values

You can use the calc_pi() function to calculate percent isoform use (pi) per gene in nearly the exact same way that you can use calc_tpm(): you can run it on either the transcript, edge, TSS, or TES level, and you can choose the metadata variable to groupby. The only difference is that for calc_pi() you must also provide an additional DataFrame object as the second argument that tells Swan what gene each entry comes from. Below the corresponding DataFrame that must be provided is listed for each AnnData:

AnnData DataFrame
SwanGraph.adata SwanGraph.t_df
SwanGraph.tss_adata SwanGraph.tss_adata.var
SwanGraph.tes_adata SwanGraph.tes_adata.var

First, we'll calculate the pi value for each transcript in each dataset:

df, sums = swan.calc_pi(sg.adata, sg.t_df)
df.head()
tid ENST00000000233.9 ENST00000000412.7 ENST00000000442.10 ENST00000001008.5 ENST00000001146.6 ENST00000002125.8 ENST00000002165.10 ENST00000002501.10 ENST00000002596.5 ENST00000002829.7 ... TALONT000482711 TALONT000482903 TALONT000483195 TALONT000483284 TALONT000483315 TALONT000483322 TALONT000483327 TALONT000483978 TALONT000484004 TALONT000484796
hepg2_1 100.000000 100.0 100.000000 100.0 0.0 100.000000 100.0 93.750000 0.0 0.0 ... 0.000000 1.904762 6.666667 13.043478 0.000000 0.000000 1.333333 100.0 100.0 0.000000
hepg2_2 99.519226 100.0 60.000004 100.0 0.0 100.000000 100.0 80.882355 0.0 0.0 ... 5.263158 3.225806 13.793103 8.695652 0.884956 3.097345 0.884956 100.0 100.0 2.380952
hffc6_1 98.039215 100.0 0.000000 100.0 0.0 100.000000 100.0 0.000000 0.0 0.0 ... 2.604167 2.092050 16.279070 1.428571 0.854701 1.196581 0.854701 0.0 100.0 1.886792
hffc6_2 99.082573 100.0 0.000000 100.0 0.0 77.777779 100.0 0.000000 0.0 0.0 ... 4.285715 2.144772 11.627908 14.893617 0.166667 1.333333 2.000000 0.0 100.0 9.859155
hffc6_3 100.000000 100.0 100.000000 100.0 0.0 85.714287 100.0 100.000000 0.0 0.0 ... 4.326923 2.536232 15.151516 10.638298 1.711491 1.711491 0.977995 0.0 100.0 13.636364

5 rows × 208306 columns

As a note, the calc_pi() function outputs not only a table of pi values but of counts per isoform per condition, which is used as an intermediate during DIE testing. To avoid recalculation, it is output here.

sums.head()
ENST00000000233.9 ENST00000000412.7 ENST00000000442.10 ENST00000001008.5 ENST00000001146.6 ENST00000002125.8 ENST00000002165.10 ENST00000002501.10 ENST00000002596.5 ENST00000002829.7 ... TALONT000482711 TALONT000482903 TALONT000483195 TALONT000483284 TALONT000483315 TALONT000483322 TALONT000483327 TALONT000483978 TALONT000484004 TALONT000484796
hepg2_1 98.0 43.0 4.0 23.0 0.0 8.0 129.0 30.0 0.0 0.0 ... 0.0 2.0 1.0 6.0 0.0 0.0 2.0 7.0 4.0 0.0
hepg2_2 207.0 66.0 6.0 52.0 0.0 11.0 323.0 55.0 0.0 0.0 ... 1.0 12.0 4.0 6.0 2.0 7.0 2.0 10.0 9.0 1.0
hffc6_1 100.0 148.0 0.0 82.0 0.0 5.0 212.0 0.0 0.0 0.0 ... 5.0 10.0 7.0 1.0 5.0 7.0 5.0 0.0 12.0 1.0
hffc6_2 108.0 191.0 0.0 98.0 0.0 7.0 173.0 0.0 0.0 0.0 ... 12.0 8.0 5.0 7.0 1.0 8.0 12.0 0.0 13.0 7.0
hffc6_3 91.0 168.0 2.0 106.0 0.0 6.0 123.0 1.0 0.0 0.0 ... 9.0 7.0 5.0 5.0 7.0 7.0 4.0 0.0 15.0 6.0

5 rows × 208306 columns

We can also calculate the pi value for the TSSs and TESs in each dataset:

df, sums = swan.calc_pi(sg.tss_adata, sg.tss_adata.var)
print(df.head())
print()

df, sums = swan.calc_pi(sg.tes_adata, sg.tes_adata.var)
print(df.head())
print()
tss_id   ENSG00000000003.14_1  ENSG00000000003.14_2  ENSG00000000003.14_3  \
hepg2_1                   0.0                 100.0                   0.0   
hepg2_2                   0.0                 100.0                   0.0   
hffc6_1                   0.0                 100.0                   0.0   
hffc6_2                   0.0                 100.0                   0.0   
hffc6_3                   0.0                 100.0                   0.0   

tss_id   ENSG00000000003.14_4  ENSG00000000005.5_1  ENSG00000000005.5_2  \
hepg2_1                   0.0                  0.0                  0.0   
hepg2_2                   0.0                  0.0                  0.0   
hffc6_1                   0.0                  0.0                  0.0   
hffc6_2                   0.0                  0.0                  0.0   
hffc6_3                   0.0                  0.0                  0.0   

tss_id   ENSG00000000419.12_1  ENSG00000000419.12_2  ENSG00000000457.13_1  \
hepg2_1                   0.0                 100.0                   0.0   
hepg2_2                   0.0                 100.0                   0.0   
hffc6_1                   0.0                 100.0                   0.0   
hffc6_2                   0.0                 100.0                   0.0   
hffc6_3                   0.0                 100.0                   0.0   

tss_id   ENSG00000000457.13_2  ...  TALONG000085596_1  TALONG000085799_1  \
hepg2_1                   0.0  ...                0.0                0.0   
hepg2_2                 100.0  ...                0.0                0.0   
hffc6_1                   0.0  ...              100.0              100.0   
hffc6_2                 100.0  ...              100.0              100.0   
hffc6_3                 100.0  ...              100.0              100.0   

tss_id   TALONG000085978_1  TALONG000086022_1  TALONG000086057_1  \
hepg2_1                0.0              100.0                0.0   
hepg2_2                0.0              100.0                0.0   
hffc6_1              100.0                0.0              100.0   
hffc6_2              100.0                0.0              100.0   
hffc6_3              100.0                0.0              100.0   

tss_id   TALONG000086218_1  TALONG000086443_1  TALONG000086539_1  \
hepg2_1                0.0              100.0                0.0   
hepg2_2                0.0              100.0                0.0   
hffc6_1              100.0              100.0              100.0   
hffc6_2              100.0              100.0              100.0   
hffc6_3              100.0              100.0              100.0   

tss_id   TALONG000086553_1  TALONG000086766_1  
hepg2_1                0.0              100.0  
hepg2_2              100.0              100.0  
hffc6_1              100.0              100.0  
hffc6_2              100.0              100.0  
hffc6_3              100.0              100.0  

[5 rows x 130176 columns]

tes_id   ENSG00000000003.14_1  ENSG00000000003.14_2  ENSG00000000003.14_3  \
hepg2_1                   0.0                 100.0                   0.0   
hepg2_2                   0.0                 100.0                   0.0   
hffc6_1                   0.0                 100.0                   0.0   
hffc6_2                   0.0                 100.0                   0.0   
hffc6_3                   0.0                 100.0                   0.0   

tes_id   ENSG00000000003.14_4  ENSG00000000003.14_5  ENSG00000000005.5_1  \
hepg2_1                   0.0                   0.0                  0.0   
hepg2_2                   0.0                   0.0                  0.0   
hffc6_1                   0.0                   0.0                  0.0   
hffc6_2                   0.0                   0.0                  0.0   
hffc6_3                   0.0                   0.0                  0.0   

tes_id   ENSG00000000005.5_2  ENSG00000000419.12_1  ENSG00000000419.12_2  \
hepg2_1                  0.0             92.592590                   0.0   
hepg2_2                  0.0             98.863640                   0.0   
hffc6_1                  0.0             98.701302                   0.0   
hffc6_2                  0.0             95.522385                   0.0   
hffc6_3                  0.0             98.181824                   0.0   

tes_id   ENSG00000000419.12_3  ...  TALONG000085596_1  TALONG000085799_1  \
hepg2_1              7.407407  ...                0.0                0.0   
hepg2_2              1.136364  ...                0.0                0.0   
hffc6_1              1.298701  ...              100.0              100.0   
hffc6_2              4.477612  ...              100.0              100.0   
hffc6_3              1.818182  ...              100.0              100.0   

tes_id   TALONG000085978_1  TALONG000086022_1  TALONG000086057_1  \
hepg2_1                0.0              100.0                0.0   
hepg2_2                0.0              100.0                0.0   
hffc6_1              100.0                0.0              100.0   
hffc6_2              100.0                0.0              100.0   
hffc6_3              100.0                0.0              100.0   

tes_id   TALONG000086218_1  TALONG000086443_1  TALONG000086539_1  \
hepg2_1                0.0              100.0                0.0   
hepg2_2                0.0              100.0                0.0   
hffc6_1              100.0              100.0              100.0   
hffc6_2              100.0              100.0              100.0   
hffc6_3              100.0              100.0              100.0   

tes_id   TALONG000086553_1  TALONG000086766_1  
hepg2_1                0.0              100.0  
hepg2_2              100.0              100.0  
hffc6_1              100.0              100.0  
hffc6_2              100.0              100.0  
hffc6_3              100.0              100.0  

[5 rows x 187454 columns]

And we can also choose to calculate pi values using a different metadata column, here shown on the cell_line column:

df, sums = swan.calc_pi(sg.adata, sg.t_df, obs_col='cell_line')
df.head()
tid ENST00000000233.9 ENST00000000412.7 ENST00000000442.10 ENST00000001008.5 ENST00000001146.6 ENST00000002125.8 ENST00000002165.10 ENST00000002501.10 ENST00000002596.5 ENST00000002829.7 ... TALONT000482711 TALONT000482903 TALONT000483195 TALONT000483284 TALONT000483315 TALONT000483322 TALONT000483327 TALONT000483978 TALONT000484004 TALONT000484796
hepg2 99.673203 100.0 71.428574 100.0 0.0 100.000000 100.0 85.0 0.0 0.0 ... 4.545455 2.935011 11.363637 10.434782 0.531915 1.861702 1.06383 100.0 100.0 1.785714
hffc6 99.006622 100.0 100.000000 100.0 0.0 85.714287 100.0 100.0 0.0 0.0 ... 3.823529 2.218279 14.285715 7.926829 0.815558 1.380176 1.31744 0.0 100.0 8.333334

2 rows × 208306 columns

Obtaining edge abundance information

In case you're interested in doing outside analyses on the level (For instance, using intron counting to assess alternative splicing), Swan provides a tool to output a DataFrame with edge abundance on the dataset level.

If we just want to get access to the edge abundance DataFrame, just use the get_edge_abundance() function:

df = sg.get_edge_abundance()
df.head()
strand edge_type annotation chrom start stop hepg2_1 hepg2_2 hffc6_1 hffc6_2 hffc6_3
0 + exon True chr1 11869 12227 0.0 0.0 0.0 0.0 0.0
1 + exon True chr1 12010 12057 0.0 0.0 0.0 0.0 0.0
2 + intron True chr1 12057 12179 0.0 0.0 0.0 0.0 0.0
3 + exon True chr1 12179 12227 0.0 0.0 0.0 0.0 0.0
4 + intron True chr1 12227 12613 0.0 0.0 0.0 0.0 0.0

You can also specify if you want the data to be output in raw counts (kind='counts') or TPM (kind='tpm). By default, this function returns counts. Here's an example with TPM:

df = sg.get_edge_abundance(kind='tpm')
df.head()
strand edge_type annotation chrom start stop hepg2_1 hepg2_2 hffc6_1 hffc6_2 hffc6_3
0 + exon True chr1 11869 12227 0.0 0.0 0.0 0.0 0.0
1 + exon True chr1 12010 12057 0.0 0.0 0.0 0.0 0.0
2 + intron True chr1 12057 12179 0.0 0.0 0.0 0.0 0.0
3 + exon True chr1 12179 12227 0.0 0.0 0.0 0.0 0.0
4 + intron True chr1 12227 12613 0.0 0.0 0.0 0.0 0.0

And finally, if you wish, you can provide the function with a prefix value which will indicate that you want the output DataFrame to be saved in TSV form.

df = sg.get_edge_abundance(kind='tpm', prefix='test')
df.head()
strand edge_type annotation chrom start stop hepg2_1 hepg2_2 hffc6_1 hffc6_2 hffc6_3
0 + exon True chr1 11869 12227 0.0 0.0 0.0 0.0 0.0
1 + exon True chr1 12010 12057 0.0 0.0 0.0 0.0 0.0
2 + intron True chr1 12057 12179 0.0 0.0 0.0 0.0 0.0
3 + exon True chr1 12179 12227 0.0 0.0 0.0 0.0 0.0
4 + intron True chr1 12227 12613 0.0 0.0 0.0 0.0 0.0

The results will be saved in '{prefix}_edge_abundance.tsv'.

Obtaining TSS/TES abundance information

Similarly, if you wish to do analysis involving your TSS or TES data, you can also output these using the get_tss_abundance() and get_tes_abundance() functions respectively. These have identical options to get_edge_abundance() so they can either output counts or TPM and optionally save to an output file.

First, let's output the TSS TPM to a file:

df = sg.get_tss_abundance(kind='tpm', prefix='test')
df.head()
tss_id gid gname vertex_id tss_name chrom coord hepg2_1 hepg2_2 hffc6_1 hffc6_2 hffc6_3
0 ENSG00000000003.14_1 ENSG00000000003.14 TSPAN6 926111 TSPAN6_1 chrX 100636191 0.00000 0.000000 0.000000 0.000000 0.00000
1 ENSG00000000003.14_2 ENSG00000000003.14 TSPAN6 926112 TSPAN6_2 chrX 100636608 232.16391 276.976654 45.962341 53.301723 68.30146
2 ENSG00000000003.14_3 ENSG00000000003.14 TSPAN6 926114 TSPAN6_3 chrX 100636793 0.00000 0.000000 0.000000 0.000000 0.00000
3 ENSG00000000003.14_4 ENSG00000000003.14 TSPAN6 926117 TSPAN6_4 chrX 100639945 0.00000 0.000000 0.000000 0.000000 0.00000
4 ENSG00000000005.5_1 ENSG00000000005.5 TNMD 926077 TNMD_1 chrX 100585066 0.00000 0.000000 0.000000 0.000000 0.00000

Now we'll get the counts of each TES without saving to a file:

df = sg.get_tes_abundance(kind='counts')
df.head()
tes_id gid gname vertex_id tes_name chrom coord hepg2_1 hepg2_2 hffc6_1 hffc6_2 hffc6_3
0 ENSG00000000003.14_1 ENSG00000000003.14 TSPAN6 926092 TSPAN6_1 chrX 100627109 0.0 0.0 0.0 0.0 0.0
1 ENSG00000000003.14_2 ENSG00000000003.14 TSPAN6 926093 TSPAN6_2 chrX 100628670 116.0 235.0 35.0 42.0 42.0
2 ENSG00000000003.14_3 ENSG00000000003.14 TSPAN6 926097 TSPAN6_3 chrX 100632063 0.0 0.0 0.0 0.0 0.0
3 ENSG00000000003.14_4 ENSG00000000003.14 TSPAN6 926100 TSPAN6_4 chrX 100632541 0.0 0.0 0.0 0.0 0.0
4 ENSG00000000003.14_5 ENSG00000000003.14 TSPAN6 926103 TSPAN6_5 chrX 100633442 0.0 0.0 0.0 0.0 0.0