Skip to content
/ srGASV Public

Structural variant detection using split reads, combined with Geometric Analysis of Structural Variants

Notifications You must be signed in to change notification settings

dstorch/srGASV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

##################################################
#  srGASV - Split Read GASV
#
# Split read analysis using the output of
# geometric analysis for structural variants.
#
# Author:
#   David Storch
#   [email protected]
#
# Contents:
#   1. Usage
#   2. Building
#   3. Parameters
#   4. Output format
#   5. Dependencies
#   6. High-throughput usage
#
##################################################

##################################################
1. Usage
##################################################

srGASV <GASV.out> <mapped.bam> <unmapped.bam> <reference.fasta>

There are three required arguments, as shown above.
The first is a tab-delimited plaintext file with
the following columns:
                                        Example
    1. Cluster Name                     c622_-22.4248_-696.71	
	2. Number of Supporting Fragments:  11	
	3. Localization:                    99.2	
	4. Variant Type:                    D	
	5. Supporting Fragments:            IL11_266:7:129:793:665, IL22_307:8:274:923:595, SRR010933.5337657, etc.
	6. LeftChromosome:                  1
	7. Right Chromosome:                1	
	8. Coordinates in Polygon:          34874107, 34884613, 34874008, 34884613, 34873958, 34884563, etc.

The second input file is a BAM file containing
the well-mapped reads. srGASV will scan this file looking for
the mates of candidate split reads. This file must be indexed,
i.e. it must have a corresponding .bam.bai file
in the same directory.

The third input file is a BAM file containing all
of the reads that are unmapped or of low mapping quality.
Once the anchor reads have been identified from the
file of mapped reads, srGASV will search this file
for the mates that could be split reads.

The final input file is the reference sequence to
which the reads were aligned, in fasta format. The
fasta files should first be indexed using "samtools faidx".
srGASV will extract windows of sequence from this file,
and then attempt to align candidate split reads to
this window.   

##################################################
2. Building
##################################################

Before running srGASV as described above, the Java
code must be compiled using the Apache Ant build tool.
If Ant is installed, then type "ant" when working
in the srGASV root directory.

See the dependencies section for third party
software required for srGASV.

##################################################
3. Parameters
##################################################

Parameters are set by editing the srGASV bash script
before running.

LMIN - minimum fragment length
  Along with LMAX, this determines the region where
  we look for anchored reads whose mate might be
  a split read.

LMAX - maximum fragment length
  Along with LMIN, this determines the region where
  we look for anchored reads whose mate might be
  a split read.

MIN_MAPQ - minimum mapping quality
  Reads are only taken as anchors if their mapping
  quality is greater than or equal to this threshold.

DELTA_WINDOW - increase in GASV prediction window size
  If a predicted GASV breakpoint region is [a, b],
  then srGASV will align potential split reads to the
  window [a - DELTA_WINDOW, b + DELTA_WINDOW]. Set to
  zero to use the GASV regions.

MAX_ALIGNMENT_DIST - only output reads with alignment
  distances lower than this threshold

CHR_PREFIX - naming convention for chromosomes
  If chromosome 17 is named as "17", set this
  parameter to "default". If the convention is
  "chr17", then set this parameter to "chr".

OUTPUT_FORMAT - set to either "verbose", "tabular", or "concise"
  The tabular setting gives the output format described
  in the section below. The verbose format gives more
  information, but is not organized into a column-based
  format. The concise format outputs a single line per
  cluster instead of a line per read. This line gives the
  majority vote breakpoints for the cluster, as well as
  the number of split reads supporting each breakpoint.

##################################################
4. Output format
##################################################

The tabular output format consists of the following
tab-delimited fields. There is one line per candidate
split-read.

1. GASV cluster name
2. Read name extracted from the BAM file
3. GASV regions for the cluster
   The regions are in format <chr1>:a-b, <chr2>:x-y
   That is, region 1 is the window [a, b] on <chr1>,
   and region 2 is the window [x, y] on <chr2>.
4. Breakpoints implied by the alignment
   The format for this field is <chr1>:b1, <chr2>:b2.
   The breakpoints b1 and b2 are on chromosomes <chr1>
   and <chr2> respectively.
5. Breakpoint polygon
   This boolean field is true if the detected breakpoints
   fall within the breakpoint polygon, and is false
   otherwise.
6. Alignment score
   The default scoring is a penalty of 1 for both mismatches
   and gaps.
7. Sequence of the split read
   A pipe ("|") is used to denote the split.

The concise output format has one line per GASV input
cluster. Each line contains the following tab-delimited fields:

1. GASV cluster name
2. GASV regions
3. consensus breakpoint 1
4. consensus breakpoint 2
5. number of reads supporting bp1
6. number of reads supporting bp2

##################################################
5. Dependencies
##################################################

Samtools
  srGASV expects to find samtools in lib/samtools.
  Please provide a sym link in this directory to
  your build of samtools.

Picard
  The ant build expects to find lib/sam-1.56.jar.
  The jar file should be checked out with the
  rest of the source code.
 
SQLite JDBC driver
  This is necessary for running srGASV in a high-
  throughput way. The data from an unmapped BAM
  file is stored in an indexed sqlite database table.
  Then, unmapped reads are retrieved via JDBC during
  the split-read alignment process. More on running
  srGASV using a SQLite database can be found in
  the next section. The jar file for the driver
  is packaged with the srGASV source code.

##################################################
6. High-throughput usage
##################################################

Given as input the unmapped reads file unmapped.bam,
srGASV checks to see if the file unmapped.bam.db exists.
If so, it assumes that it is a SQLite database containing
the following table:

CREATE TABLE bamfile (name varchar(50), sequence varchar(100) );

The table should be indexed by read name using the
following SQL command:

CREATE INDEX bamindex on bamfile(name);

In order to populate the database table, the utility
script src/script/makedb.py is provided. See the
comments in this script for usage.


About

Structural variant detection using split reads, combined with Geometric Analysis of Structural Variants

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published