Spliceogen is an integrative, scalable tool for the discovery of splice-altering variants. Variants are assessed for their potential to create or disrupt any of the cis motifs which guide splice site definition: donors, acceptors, branchpoints, enhancers and silencers. Spliceogen integrates scores from MaxEntScan1, GeneSplicer2, ESRseq3 and Branchpointer4, and provides predictions based on logistic regression models trained on reported splice-altering variants5. Spliceogen accepts VCF/TSV inputs and handles both SNVs and indels.
Publication: https://doi.org/10.1093/bioinformatics/btz263
Maintainer: Steve Monger - [email protected]
Navigate to your desired installation directory and clone this repository:
git clone https://github.com/VCCRI/Spliceogen.git Spliceogen
-Any whole genome FASTA (.fa)
-Any GTF genome annotation (.gtf)
FASTA/GTF files can be downloaded from Gencode
Alternatively, some recent (as of 2019) hg38 releases can be retrieved using:
> wget ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.alt.fa.gz
> wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.basic.annotation.gtf.gz
-Bedtools
-Java
> cd path/to/Spliceogen
> ./RUN.sh -input path/to/variant/file(s) -fasta path/to/hgXX.fa -gtf path/to/annotation.gtf
Example VCF, TSV, GTF and FASTA files are provided to demonstrate input and output formats. Run this small example using the following command:
> ./RUN.sh -input toy/toy.vcf -gtf toy/toy.gtf -fasta toy/toy.fa
As an alternative to VCF, a custom tab-separated format is allowed (chr start ref alt). Gzipped GTF/VCF/TSV files are accepted.
Spliceogen is highly scalable. Predictions are generated at a rate of 2.3 million variants/compute hour, with peak memory usage less than 500Mb. Benchmarking was performed using a single compute node with 1 CPU allocated.
We developed logistic regression models for each of the following classes of splice-altering variants: donor loss, acceptor loss, donor gain and acceptor gain. Using these models, we derive probability values which are used to rank variants based on the likelihood that they will cause each kind of splice-altering variant. Variants outside of splice sites are assigned donor and acceptor gain scores only, while variants within donor/acceptor splice sites are assigned only donor/acceptor loss scores.
Note that these probability values are used for ranking only; they should not be interpreted as the actual probability of splice alteration. Similarly, the scores should not be compared between different classes (gain/loss etc).
Multiple output files are created for each VCF/TSV in the Spliceogen/output directory. A master "_out" file contains all scores for all variants, in a format suitable for ANNOVAR6 variant annotation. Several additional files show predictions for variants identified as most likely to be disruptive, ranked in descending order. The specific files generated are as follows:
-
"$file"_out.txt:
Contains all scores generated for every variant, sorted in standard ascending chromosomal/start order.
-
"$file"_withinSS.txt:
Contains all variants that overlap annotated splice sites. The overlapping splice sites are denoted by their exonID and "_donor" or "_acceptor". Variants are sorted by the maximum of donLossP and accLossP, such that variants most likely to disrupt acceptor/donor splice sites appear at the top of this file.
-
"$file"_ssGain.txt
Contains variants outside of existing splice sites that are predicted to create donor or acceptor motifs. Variants are sorted by the maximum of donGainP and accGainP, such that variants most likely to create acceptor/donor splice sites appear at the top of this file.
-
"$file"_bpOutput.txt
Contains Branchpointer prediction scores, including whether the variant is predicted to create or remove a branchpoint, based on the recommended Branchpointer thresholds.
The following abbreviations are used in the output headers:
donGainP = donor creation logistic regression probability value
accGainP = acceptor creation logistic regression probability value
donLossP = donor disruption logistic regression probability value
accLossP = acceptor disruption logistic regression probability value
withinSS = within splice site
don = Donor
acc = Acceptor
ref = Reference allele
alt = Alternative allele
mes = MaxEntScan
gs = GeneSplicer
ESS = exonic splicing silencer (ESRseq score)
ESE = exonic splicing enhancer (ESRseq score)
So for example, the column "gsDonRef" contains GeneSplicer scores representing donor motif strength for the reference sequence, whereas "mesDonAlt" consists of MaxEntScan scores representing acceptor motif strength for the alternative sequence.
To include Branchpointer predictions, include the -branchpointer flag and specify the genome build:
*basic usage command* -branchpointer hgXX
Or for branchpointer_dev which handles both SNPs and indels, use the flag -branchpointerIndels hgXX
-R (tested on v3.4.3)
-Branchpointer
-BSgenome
The current Bioconductor release of Branchpointer supports SNV predictions. To install it from an R prompt:
> source("https://bioconductor.org/biocLite.R")
> biocLite("branchpointer")
The development version of Branchpointer also supports indels. To install this version instead:
> library(devtools)
> install_github("betsig/branchpointer_dev")
From an R prompt, install the hg38 BSgenomes package using the below command. For hg19, edit the 2nd line to "hg19".
> source("https://bioconductor.org/biocLite.R")
> biocLite("BSgenome.Hsapiens.UCSC.hg38")
-
Yeo, G., Burge, C., "Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals", J Comput Biol. 2004; 11(2-3):377-94
-
Pertea, M., Lin, X., Salzberg, S., "GeneSplicer: a new computational method for splice site prediction", Nucleic Acids Res. 2001; 29(5):1185-90
-
Shendong, K., et al., "Quantitative evaluation of all hexamers as exonic splicing elements", Genome Res. 2011; 21(8):1360-1374
-
Signal, B., et al., "Machine learning annotation of human branchpoints", Bioinformatics. 2018; 34(6):920-927
-
Shiraishi, Y., et al., "A comprehensive characterization of cis-acting splicing-associated variants in human cancer", Genome Res. 2018; 28(8):1111-1125