GitHub - dstorch/srGASV: Structural variant detection using split reads, combined with Geometric Analysis of Structural Variants

dstorch / srGASV Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Structural variant detection using split reads, combined with Geometric Analysis of Structural Variants

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
lib		lib
src		src
README		README
build.xml		build.xml
srGASV		srGASV

Repository files navigation

##################################################
# srGASV - Split Read GASV
#
# Split read analysis using the output of
# geometric analysis for structural variants.
#
# Author:
# David Storch
# [email protected]
#
# Contents:
# 1. Usage
# 2. Building
# 3. Parameters
# 4. Output format
# 5. Dependencies
# 6. High-throughput usage
#
##################################################

##################################################
1. Usage
##################################################

srGASV <GASV.out> <mapped.bam> <unmapped.bam> <reference.fasta>

There are three required arguments, as shown above.
The first is a tab-delimited plaintext file with
the following columns:
Example
1. Cluster Name c622_-22.4248_-696.71
2. Number of Supporting Fragments: 11
3. Localization: 99.2
4. Variant Type: D
5. Supporting Fragments: IL11_266:7:129:793:665, IL22_307:8:274:923:595, SRR010933.5337657, etc.
6. LeftChromosome: 1
7. Right Chromosome: 1
8. Coordinates in Polygon: 34874107, 34884613, 34874008, 34884613, 34873958, 34884563, etc.

The second input file is a BAM file containing
the well-mapped reads. srGASV will scan this file looking for
the mates of candidate split reads. This file must be indexed,
i.e. it must have a corresponding .bam.bai file
in the same directory.

The third input file is a BAM file containing all
of the reads that are unmapped or of low mapping quality.
Once the anchor reads have been identified from the
file of mapped reads, srGASV will search this file
for the mates that could be split reads.

The final input file is the reference sequence to
which the reads were aligned, in fasta format. The
fasta files should first be indexed using "samtools faidx".
srGASV will extract windows of sequence from this file,
and then attempt to align candidate split reads to
this window.

##################################################
2. Building
##################################################

Before running srGASV as described above, the Java
code must be compiled using the Apache Ant build tool.
If Ant is installed, then type "ant" when working
in the srGASV root directory.

See the dependencies section for third party
software required for srGASV.

##################################################
3. Parameters
##################################################

Parameters are set by editing the srGASV bash script
before running.

LMIN - minimum fragment length
Along with LMAX, this determines the region where
we look for anchored reads whose mate might be
a split read.

LMAX - maximum fragment length
Along with LMIN, this determines the region where
we look for anchored reads whose mate might be
a split read.

MIN_MAPQ - minimum mapping quality
Reads are only taken as anchors if their mapping
quality is greater than or equal to this threshold.

DELTA_WINDOW - increase in GASV prediction window size
If a predicted GASV breakpoint region is [a, b],
then srGASV will align potential split reads to the
window [a - DELTA_WINDOW, b + DELTA_WINDOW]. Set to
zero to use the GASV regions.

MAX_ALIGNMENT_DIST - only output reads with alignment
distances lower than this threshold

CHR_PREFIX - naming convention for chromosomes
If chromosome 17 is named as "17", set this
parameter to "default". If the convention is
"chr17", then set this parameter to "chr".

OUTPUT_FORMAT - set to either "verbose", "tabular", or "concise"
The tabular setting gives the output format described
in the section below. The verbose format gives more
information, but is not organized into a column-based
format. The concise format outputs a single line per
cluster instead of a line per read. This line gives the
majority vote breakpoints for the cluster, as well as
the number of split reads supporting each breakpoint.

##################################################
4. Output format
##################################################

The tabular output format consists of the following
tab-delimited fields. There is one line per candidate
split-read.

1. GASV cluster name
2. Read name extracted from the BAM file
3. GASV regions for the cluster
The regions are in format <chr1>:a-b, <chr2>:x-y
That is, region 1 is the window [a, b] on <chr1>,
and region 2 is the window [x, y] on <chr2>.
4. Breakpoints implied by the alignment
The format for this field is <chr1>:b1, <chr2>:b2.
The breakpoints b1 and b2 are on chromosomes <chr1>
and <chr2> respectively.
5. Breakpoint polygon
This boolean field is true if the detected breakpoints
fall within the breakpoint polygon, and is false
otherwise.
6. Alignment score
The default scoring is a penalty of 1 for both mismatches
and gaps.
7. Sequence of the split read
A pipe ("|") is used to denote the split.

The concise output format has one line per GASV input
cluster. Each line contains the following tab-delimited fields:

1. GASV cluster name
2. GASV regions
3. consensus breakpoint 1
4. consensus breakpoint 2
5. number of reads supporting bp1
6. number of reads supporting bp2

##################################################
5. Dependencies
##################################################

Samtools
srGASV expects to find samtools in lib/samtools.
Please provide a sym link in this directory to
your build of samtools.

Picard
The ant build expects to find lib/sam-1.56.jar.
The jar file should be checked out with the
rest of the source code.

SQLite JDBC driver
This is necessary for running srGASV in a high-
throughput way. The data from an unmapped BAM
file is stored in an indexed sqlite database table.
Then, unmapped reads are retrieved via JDBC during
the split-read alignment process. More on running
srGASV using a SQLite database can be found in
the next section. The jar file for the driver
is packaged with the srGASV source code.

##################################################
6. High-throughput usage
##################################################

Given as input the unmapped reads file unmapped.bam,
srGASV checks to see if the file unmapped.bam.db exists.
If so, it assumes that it is a SQLite database containing
the following table:

CREATE TABLE bamfile (name varchar(50), sequence varchar(100) );

The table should be indexed by read name using the
following SQL command:

CREATE INDEX bamindex on bamfile(name);

In order to populate the database table, the utility
script src/script/makedb.py is provided. See the
comments in this script for usage.