-
Notifications
You must be signed in to change notification settings - Fork 0
dstorch/srGASV
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
################################################## # srGASV - Split Read GASV # # Split read analysis using the output of # geometric analysis for structural variants. # # Author: # David Storch # [email protected] # # Contents: # 1. Usage # 2. Building # 3. Parameters # 4. Output format # 5. Dependencies # 6. High-throughput usage # ################################################## ################################################## 1. Usage ################################################## srGASV <GASV.out> <mapped.bam> <unmapped.bam> <reference.fasta> There are three required arguments, as shown above. The first is a tab-delimited plaintext file with the following columns: Example 1. Cluster Name c622_-22.4248_-696.71 2. Number of Supporting Fragments: 11 3. Localization: 99.2 4. Variant Type: D 5. Supporting Fragments: IL11_266:7:129:793:665, IL22_307:8:274:923:595, SRR010933.5337657, etc. 6. LeftChromosome: 1 7. Right Chromosome: 1 8. Coordinates in Polygon: 34874107, 34884613, 34874008, 34884613, 34873958, 34884563, etc. The second input file is a BAM file containing the well-mapped reads. srGASV will scan this file looking for the mates of candidate split reads. This file must be indexed, i.e. it must have a corresponding .bam.bai file in the same directory. The third input file is a BAM file containing all of the reads that are unmapped or of low mapping quality. Once the anchor reads have been identified from the file of mapped reads, srGASV will search this file for the mates that could be split reads. The final input file is the reference sequence to which the reads were aligned, in fasta format. The fasta files should first be indexed using "samtools faidx". srGASV will extract windows of sequence from this file, and then attempt to align candidate split reads to this window. ################################################## 2. Building ################################################## Before running srGASV as described above, the Java code must be compiled using the Apache Ant build tool. If Ant is installed, then type "ant" when working in the srGASV root directory. See the dependencies section for third party software required for srGASV. ################################################## 3. Parameters ################################################## Parameters are set by editing the srGASV bash script before running. LMIN - minimum fragment length Along with LMAX, this determines the region where we look for anchored reads whose mate might be a split read. LMAX - maximum fragment length Along with LMIN, this determines the region where we look for anchored reads whose mate might be a split read. MIN_MAPQ - minimum mapping quality Reads are only taken as anchors if their mapping quality is greater than or equal to this threshold. DELTA_WINDOW - increase in GASV prediction window size If a predicted GASV breakpoint region is [a, b], then srGASV will align potential split reads to the window [a - DELTA_WINDOW, b + DELTA_WINDOW]. Set to zero to use the GASV regions. MAX_ALIGNMENT_DIST - only output reads with alignment distances lower than this threshold CHR_PREFIX - naming convention for chromosomes If chromosome 17 is named as "17", set this parameter to "default". If the convention is "chr17", then set this parameter to "chr". OUTPUT_FORMAT - set to either "verbose", "tabular", or "concise" The tabular setting gives the output format described in the section below. The verbose format gives more information, but is not organized into a column-based format. The concise format outputs a single line per cluster instead of a line per read. This line gives the majority vote breakpoints for the cluster, as well as the number of split reads supporting each breakpoint. ################################################## 4. Output format ################################################## The tabular output format consists of the following tab-delimited fields. There is one line per candidate split-read. 1. GASV cluster name 2. Read name extracted from the BAM file 3. GASV regions for the cluster The regions are in format <chr1>:a-b, <chr2>:x-y That is, region 1 is the window [a, b] on <chr1>, and region 2 is the window [x, y] on <chr2>. 4. Breakpoints implied by the alignment The format for this field is <chr1>:b1, <chr2>:b2. The breakpoints b1 and b2 are on chromosomes <chr1> and <chr2> respectively. 5. Breakpoint polygon This boolean field is true if the detected breakpoints fall within the breakpoint polygon, and is false otherwise. 6. Alignment score The default scoring is a penalty of 1 for both mismatches and gaps. 7. Sequence of the split read A pipe ("|") is used to denote the split. The concise output format has one line per GASV input cluster. Each line contains the following tab-delimited fields: 1. GASV cluster name 2. GASV regions 3. consensus breakpoint 1 4. consensus breakpoint 2 5. number of reads supporting bp1 6. number of reads supporting bp2 ################################################## 5. Dependencies ################################################## Samtools srGASV expects to find samtools in lib/samtools. Please provide a sym link in this directory to your build of samtools. Picard The ant build expects to find lib/sam-1.56.jar. The jar file should be checked out with the rest of the source code. SQLite JDBC driver This is necessary for running srGASV in a high- throughput way. The data from an unmapped BAM file is stored in an indexed sqlite database table. Then, unmapped reads are retrieved via JDBC during the split-read alignment process. More on running srGASV using a SQLite database can be found in the next section. The jar file for the driver is packaged with the srGASV source code. ################################################## 6. High-throughput usage ################################################## Given as input the unmapped reads file unmapped.bam, srGASV checks to see if the file unmapped.bam.db exists. If so, it assumes that it is a SQLite database containing the following table: CREATE TABLE bamfile (name varchar(50), sequence varchar(100) ); The table should be indexed by read name using the following SQL command: CREATE INDEX bamindex on bamfile(name); In order to populate the database table, the utility script src/script/makedb.py is provided. See the comments in this script for usage.
About
Structural variant detection using split reads, combined with Geometric Analysis of Structural Variants
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published