Skip to content

Scylla 5.2.5 Design Document

tamsen edited this page Mar 23, 2018 · 10 revisions

Overview

Scylla rapidly detects multiple nucleotide variants (MNVs) in a given sample.

Small variants already detected in the sample as SNPs and indels are input to Scylla as a gVCF. Scylla 5.2.5+ can also phase MNV variants in the input vcf. Scylla uses the original BAM to determine which of these small variants should be phased together into longer MNVs. Unlike Pisces, Scylla does not require that variants be on the same read to be phased. Once the phasing is complete, a new gVCF is generated.

At a high level, Scylla identifies variants that are candidates for phasing in the input gVCF, and arranges the variants into local neighborhoods. Scylla then mines the sample BAM file for any evidence that these small variants occur in the same clonal sub-populations with each other (ie, in phase with each other). This is done by clustering overlapping reads in the neighborhood into a minimal set of clusters which contain the same variants.

Variants are detected by examining the CIGAR strings in the BAM file and comparing read sequences to the reference sequence. Scylla does not perform any re-alignment as this was deemed unnecessary and a duplication of functionality that belongs in the aligner. Because of this approach, Scylla is dependent on the preferences of the upstream aligner and variant caller in representing complex variants.

Each candidate MNV that is detected is processed and output as an independent call. By default, similar to Pisces, each variant will be output on a separate gVCF line. In other words, multi-allelic variants, such as A->T/C, will be represented as two separate variants, A->T and A->C, on separate lines, but for the same position. They will be genotyped independently. If "crucshvcf" output format is selected, the results will be output in standard gvcf format, with one line per genomic loci. Scylla supports two ploidy choices: somatic or diploid. If diploid is selected, the user should choose the "crushvcf" formatted output. Output format details are in the Pisces vcf spec document.

Glossary

Pisces Glossary

Configuration

Scylla supports configuration of parameters so that its behavior can be fine tuned depending on the application context.

Format: dotnet Scylla.dll [-options]

Example: dotnet Scylla.dll -bam NA12877PG4L_S1.bam -vcf input.genome.vcf -out \myoutfolder

SDS ID Specification
SDS-1 Scylla shall accept command line arguments as a whitespace-separated list of name and value pairs.
SDS-2 If an invalid command is given, Scylla shall exit with an error message describing the failed argument, the reason for failure, and the list of valid commands.
SDS-3 Scylla command line shall be capitalization invariant.
SDS ID Specification
SDS-4 Scylla shall require the command line arguments listed below:
Argument Name Type Default value Description
bam string none File path for input bam
vcf string none File path for input vcf
SDS ID Specification
SDS-6 Scylla shall accept the optional variant-calling command line arguments listed below. These options should match the input used in the original gVCF creation, so the new MNVs are filtered in the same manner as the input variants.
Argument Name Type Default value Description
ploidy string somatic "somatic" or "diploid". Describes the mode for genotyping variants (use crushvcf with diploid mode)
crushvcf bool false mode for outputting vcf. (use crushvcf true with diploid mode)
dist integer 50 How close variants need to be to chain together into a neighborhood. The value should be less than read length. For somatic, we typically use 50, and 5 for diploid (because germline users prefer not to see long chains of MNVs)
b,minbasecallquality integer 20 Bases in a reads must have a basecall quality that is greater than or equal to this number, to be used in phasing.
filterduplicates bool true Filters duplicate reads
m,minmapquality integer 1 Reads must has a map quality that is greater than or equal to this number, to be used in phasing.
minvariantfrequencyfilter double 0.01 true
minimumfrequency double 0.01 Variants with a frequency lower than this value shall be marked as nocalls. "./." . (this will be changed to a true omit in the next release)
minvariantqscore integer 30 true
variantqualityfilter integer 30 Variants with a quality score lower than this value shall be marked as filtered, with the "LowQ" tag.

| SDS-7 |Scylla shall accept the optional command line arguments listed below, for development and trouble shooting use use only. These are un-validated parameters.

Argument Name Type Default value Description
passingvariantsonly bool true Only include passing variants in phasing.
allowclustermerging bool true Allow clusters of variants to merge, once a read has been discovered that can belong in both sets without contradicting either one.
allowworstfitremoval bool true Allow clusters to prune reads that are a poor fit for the cluster, if they are discovered to have become a better fit for another cluster.
debug bool false Adds more print statements to log and slows down performance.
maxnbhdstoprocess integer -1 Stop processing after a certain number of neighborhoods have been discovered.
t integer 20 Max number of threads to use
chr comma separated string list {empty} Debug option to limit processing to the given set of chromosomes.
passingvariantsonly bool true Only use passing variants for phasing
hetvariantsonly bool true Only use het variants for phasing
SDS ID Specification
SDS-8 If an unknown command line argument is encountered, Scylla shall exit with an error message describing the unknown argument.

Input

Scylla requires as input one BAM file and one gVCF file. The BAM file is assumed to be sorted, i.e. alignment reads should be sorted by mapped reference position. This assumption allows SVC to process BAMs in one pass without a large memory footprint. Most standard aligners produce sorted BAM files. Positions in BAM files are expected to use 0-based coordinate system. Similarly, the gVCF file should also be sorted. The gVCF file should be formatted such that each variant allele has its own line in the gVCF. file. Pisces output has this format by default.

For performance, Scylla requires each BAM file to have an accompanying BAI index file. The index file allows Scylla to efficiently jump to chromosomes it is configured to call.

SDS ID Specification
SDS-9 Scylla shall require one BAM file as input.
SDS-10 If no BAM file is specified, Scylla shall exit with an error message.
SDS-11 Scylla shall require an accompanying BAI file with the same file name in the same directory as the given BAM file.
SDS-12 Scylla shall require one gVCF file as input.
SDS-13 If no gVCF file is specified, Scylla shall exit with an error message.

Output

Scylla outputs allele calls in the form of a somatic-formatted gVCF file, with the same convention and structure as described in the Pisces SDS.

Design

The Scylla application is essentially a pipeline program, with the flow shown in the diagram below.

Neighborhood Creation

The input gVCF is parsed for sets of passing, heterozygous variants within 50 bases of each other (although these exact parameters are configurable). Each set is called a neighborhood. In the first box in the diagram above, we depict 6 variants found in the gVCF that fall into 3 distinct neighborhoods. Since neighborhoods by definition do not overlap, they are phased in parallel, and the program threads join back together at gVCF writing time.

It is OK if variants are co-located at the same site. These will naturally fall into separate clusters within a neighborhood, because the same read will never contain to different variants in the same place.

Read Extraction

For each neighborhood, all the reads that overlap its variant sites are extracted from the BAM. Each read is informatically compressed, so we keep only the bases at sites of interest.

We call theses variant-compressed reads 'veads'. After compression, many reads that used to be slightly different, for example at positions we are not interested in, now look identical across the positions of interest. This effectively reduces noise and bandwidth.

VeadGroup Creation

Since many veads are identical across the positions of interest, we collapse them into a singe vead with a multiplier. This is called a veadgroup. At this point, 1000 different reads originally in a neighborhood may have collapsed down to two distinct veadgroups, and the clustering problem has become trivial.

Clustering

When clustering is required, veadgroups are clustered with a greedy clustering algorithm. By default, all veadgroups that agree (have at least one call in common with each other and zero disagreeing calls) are placed in a unique cluster. (Note, just because veadgroups agree, they are not necessarily identical, because not all reads will have covered all variants in the neighborhood.) If it turns out that one veadgroup can fit in multiple clusters, that veadgroup may be used to merge to clusters. Merging may only proceed if the the clusters may be merged without inducing any disagreements. As clusters evolve, worst-fitting veadgroups are tested to see if they have a better fit elsewhere. If they do, they are either moved or used to join clusters.

When the clusters have stabilized, or the max number of iterations have been exceeded, clustering is complete. Clustering is generally rapidly convergent, and so the max number of iterations is currently set at 10.

Variant Phasing

Each cluster may be be mapped back to its supporting reads, with a common set of calls (reference or particular variant) at the sites of interest. For each cluster, these common calls can be converted into a consensus allele. In this way, every neighborhood has been translated into a set of clusters, and each cluster represents a unique MNV (or, if no variants were present, reference allele) within that neighborhood.

Variant Calling

Each cluster may be be mapped back to its supporting reads, with a common set of calls (reference or particular variant) at the sites of interest. For each cluster, these common calls can be converted into a consensus allele. In this way, every neighborhood has been translated into a set of clusters, and each cluster represents a unique MNV (or, if no variants were present, reference allele) within that neighborhood.

Variant Calling proceeds as follows:

(1) For each neighborhood, a list of candidate MNVs are extracted (one per cluster) and processed.

  • The processing calculates the variant quality score, genotype, and genotype quality score. These calculations are the same as described in the Pisces SDS, and use the same software library.
  • Scylla somatic genotyping is exactly as described in the Pisces SDS. For example, given two low-frequency variants both at position 10, A->T and A→C, Scylla will return the 0/1 genotype for both.
  • Scylla germline genotyping uses the same algorithms as Pisces germline, but there is a critical difference. In germline mode, Scylla genotypes an entire neighborhood together, even if the input variants are technically at different loci. (Pisces does germline genotpying, grouping by a single loci).

For example, given two 50% frequency variants both at position 10, A->T and A→C, Scylla will return the 1/2 genotype for both.

Furthermore, given a 50% frequency variant at position 10, A→T, and an additional 50% frequency variant at position 12, A→C, there are two possibilities:

If the variants are in phase, Scylla will return the 0/1 genotype, and return an MNV at position 10, ARRA→TRRC. (R being what ever the reference bases are)

If the variants are not in phase, Scylla will return the 1/2 genotype for both, and return two MNV's at position 10, A,RRRA→T,ARRC. (R being what ever the reference bases are). The VF reported for position 10 would be 50+50=100%.

This type of reporting is especially helpful when there are indels on one allele, which extend through different indels on the other allele.

(2) If a candidate MNV passes the omit thresholds (in terms of depth, variant quality and frequency) it becomes a called variant. If the candidate fails to meet the omit thresholds it becomes a no call with the gentoype "./."

  • Question: Why output no-calls for MNVs failing omit filters? Why not just return the original variants in the input gVCF?
  • Answer: The bases that support that original variant would have been sucked up into upstream MNVs; they are already used up. Ie, the original variant no longer exists. What remains at that loci is a call that doesn’t pass emit filters anymore. (it could be re-spun as a candidate ref call, but for the user I thought it might be nice to keep the traceability, so they can see what happened to the original variant, instead of it vanishing into a new ref call. Next sprint, I might revisit that gentleness.)

(3) Called variants are further filtered (in terms of depth filter, variant quality filter and frequency filter thresholds). This determines whether a variant will be marked PASS or filtered.

(4) The final results are written to gVCF.

VCF Generation

Scylla reads in the original gVCF, replacing all the phase-able variants that were in the neighborhoods with the newly minted MNVs. It may be the case that some MNVs in the original vcf are now reference or no calls. It may also be the case that new variants at new positions are inserted.

The output gvcf may be somatic mode (un-crushed) or germline mode (crushed). This follows the Pisces convention: Pisces VCF Specifications 5.1.6.x .,. Pisces and Scylla use the same vcf-writing library.

In uncrushed mode, each detected allele is written to its own line. There may be multiple vcf lines per genomic position. In crushed mode, there is one line per neighborhood.

Example somatic/uncrushed gVCF output:

chr6 31084942 . GCTT G 100 PASS DP=532 GT:GQ:AD:DP:VF:NL:SB 1/2:100:262:532:0.49:20:-100.0000
chr6 31084943 . CTT C 100 PASS DP=532 GT:GQ:AD:DP:VF:NL:SB 1/2:100:262:532:0.49:20:-100.0000
chr6 31084944 . T . 8 q30 DP=532 GT:GQ:AD:DP:VF:NL:SB ./.:0:8:532:0.98:20:-6.1795
chr6 31084945 . T . 8 q30 DP=532 GT:GQ:AD:DP:VF:NL:SB ./.:0:8:532:0.98:20:-6.1795
chr6 31084946 . C . 100 PASS DP=532 GT:GQ:AD:DP:VF:NL:SB 0/0:100:532:532:0.00:20:-100.0000

Example diploid/crushed gVCF output:

chr6 31084942 . GCTT G,GC 100 PASS DP=532 GT:GQ:AD:DP:VF:NL:SB 1/2:100:262,262:532:0.98:20:-100.0000
chr6 31084943 . C . 0 q30 DP=532 GT:GQ:AD:DP:VF:NL:SB ./.:0:8:532:0.98:20:-100.0000
chr6 31084944 . T . 8 q30 DP=532 GT:GQ:AD:DP:VF:NL:SB ./.:0:8:532:0.98:20:-6.1795
chr6 31084945 . T . 8 q30 DP=532 GT:GQ:AD:DP:VF:NL:SB ./.:0:8:532:0.98:20:-6.1795
chr6 31084946 . C . 100 PASS DP=532 GT:GQ:AD:DP:VF:NL:SB 0/0:100:532:532:0.00:20:-100.0000

Limitations

Variants are detected by examining the CIGAR strings in the BAM file and comparing read sequences to the reference sequence. Scylla does not perform any re-alignment as this was deemed unnecessary and a duplication of functionality that belongs in the aligner. Because of this approach, Scylla is dependent on the preferences of the upstream aligner and variant caller in representing complex variants.

Scylla cannot discover any variants that were not in the original gVCF

Scylla cannot do the job of an indel realigner. For example, a common error mode of aligners is that true indels at the end of a read get incorrectly converted to false positive SNPs. (For example, a 2 base insertion, A-> ACG, where the true (sequenced) allele is A[CG]AGAG and the reference might be AAGAG ) gets incorrectly classified by the aligner as C->A SNP.) This is an example of a case Scylla cannot rescue, because SNP fed into Scylla is itself in error.

General

5.2.10

5.2.9

5.2.7

5.2.5

5.2.0

5.1.6

5.1.3

Clone this wiki locally