Skip to content
tamsen edited this page Nov 23, 2016 · 33 revisions

Pisces is a small variant calling application, tuned for amplicons and enrichment panels. By default, it calls low frequency somatic variants, but it also supports germline calling. Pisces is a "software suite" that has several executables:

  • Stitcher - Stitches two paired-end reads together into a single read

  • Pisces - Calls small variants

  • Scylla - Detects multiple nucleotide variants (MNVs) in a given sample and phases the variants though complex regions

  • VariantQualityRecalibration - Recalibrates the variant quality scores (Q scores) if the particular variants are over represented

Introduction

Pisces is a variant calling application, tuned for amplicons and enrichment panels. By default, it calls low frequency somatic variants, but it also supports germline calling. Somatic variant calling is the effort to identify variants present at low frequency in the DNA sample. Somatic variant calling is important in the context of cancer treatment. Cancer is caused by an accumulation of mutations in DNA. A DNA sample from a tumor is generally heterogeneous, including some normal cells, some cells at an early stage of cancer progression (with fewer mutations), and some late-stage cells (with more mutations). Because of this heterogeneity, when sequencing a tumor (e.g. from an FFPE sample), somatic mutations will often appear at a low frequency. For example, a SNV might be seen in only 10% of the reads covering a given base.

Pisces, originally known as "the Illumina Somatic Variant Caller", is available on the MiSeq, and is specifically designed for variant calllng tumor-only (as opposed to tumor-normal paired samples). Example workflows that use Pisces are TSA (TruSeq Amplicon, or lately just called TruSeq Amplicon) and Amplicon DS. The assays target loci of interest to cancer researchers. MiSeq Reporter (MSR) processes the data and performs somatic variant calling instead of using a germline variant caller (such as GATK or starling).

Pisces identifies small variants, while filtering for mismatches due to sequencing or sample preparation errors. For each variant we note the reference and variant sequences (e.g. A to C SNV, or AG to A deletion), the frequency, and a q-score indicating our confidence that the variant is indeed present in the sample. Variants are reported in .vcf format (a standard tab-delimited format for storing variant calls). Variants can then be displayed in a variety of tools (including the MiSeq Reporter UI).

Note that patients will also have normal variants in their genome, which should be observed at 50% frequency (if heterozygous), or 100% (if homozygous). When paired tumor and normal DNA samples from the same individual are available, it is possible to do a tumor/normal analysis, filtering out the healthy variants to give a report of just the somatic variants. Illumina's tool Strelka is one tool for doing this analysis. Pisces is intended for cases where only a cancer sample is available; it does not handle Tumor/Normal paired samples. Although normal samples can also be run through the caller, and the differences inspected, this is not currently part of any workflow that uses Pisces.

The Amplicon-DS workflow uses paired samples with a forward/reverse paired probe pool design. Both samples are cancer samples. These samples have a complimentary library prep, such that FFPE artifacts occur in one sample and not the other. True variants show up in both pools. This consensus-building method screens out FFPE artifacts that often saturate old cancer samples.

Pisces processes every sample individually, and the paired information is handled by post processing.

Variant calling logic

Overview

For SNV calling, we consider each position in the reference genome separately. We start with the bases of the aligned reads (alignments from the MSR Amplicon Aligner, which is a banded Smith-Waterman aligner, and developed internally). We filter out any base calls below Q20. We don't bother calling any position with total coverage below 10.

These thresholds are configurable.

We consider any non-reference allele with frequency of 1% or greater. Based on the reference and mutant counts, we compute a q-score for the SNV based on a Poisson model, and filter the SNV if its q-score is below 20. If the q-score is between 20 and 30 we report the variant with a “q30” in the filter column of the .vcf. If the q-score is greater or equal to 30, then we give the variant a clean “PASS” in the filter column.

Indels are handled similarly: We look at how many of the alignments covering a given position include a particular indel (that is our variant count) versus the overall coverage of that position. Note that we do not perform some of the indel realignment or "indel cleaning" steps included in more sophisticated variant callers. Genotype calculation

When genotyping in the somatic context, Pisces uses "0/1" to denote that both reference and alternate alleles have been found at the minimum threshold of detection, be default 1%. Note this not the same a "0/1" genotype in a diploid context, where reference and alternate alleles have been found at roughly 50% each. For example, if a variant is found at 80% frequency and the reference is found at 5% frequency, and the Q scores are acceptable for both, this is considered a heterozygous somatic call, because both are present. Even though, intuitively, the user may consider this to be a homozygous variant, in the diploid context

Q-score calculation

Q-scores give the level of certainty that a variant is present. Higher q-scores represent more certainty. Our q-score calculation takes into account noise, coverage depth, and mutation count. Greater mutation count and coverage depth increases the q-score gets. Greater noise level decreases the q-score. (By default, we use a very simple noise model: Its always 1%, based on the base call quality threshold. The user can also opt to use a “Window” noise model. In that case, the noise level that goes into the q-score calculation will be based on the average q-score of the neighboring bases.) Technically our q-scores can range for 0 to infinity, but by default we cap them at 100.

Our q-scores are determined directly from the phred-scaled P-value: Q = -10*log10(P). Intuitively, Q/10 is how far behind the decimal point you need to go to describe your chance of type-1 error.

Q-value Chance False Positive "N" as in 1-in-N error
0 1 1
10 0.1 10
20 0.01 100
30 0.001 1000

P-value calculation

Our P-values are calculated by modeling variant observation as a Poisson process. The "P" corresponds to the chance the variant is called in error, ie the chance of false positive. This statistic is also known as the attained significance of the variant call.

In the plot below, we show the probabilities of making specific observations, according to our Poisson model. The y axis is the probability. The x axis is the observed number of mismatches. These particular plots were made assuming a coverage depth of 1000. (The peaks thin as coverage goes up.) Each colored line assumes a different underlying variant frequency. As shown below, for a 4% variant, the probability of observation peaks at 40 observations out of a depth of 1000. For a 10% variant, the probability of observation peaks at 100 observations out of a depth of 1000.

alt text

We assume a noise level of 1% (based on the Q20 base call filter). To make our null-hypothesis as conservative as possible, we assume all our noise is occurring in the variant channel.* If no variant is present and the signal is purely noise, we expect our observations to follow the distribution given by the blue line, peaking at 10 mismatches out of a depth of 1000. In this manner, the blue line represents our null hypothesis.

For simplicity, and to be conservative, the somatic variant caller only considers two alleles: reference and variant. Anyother calls are considered reference calls. For example, A/C/G/T counts of 100/10/1/1 for reference A are considered to be K = 10 variants out of N = 112 coverage. Under the null hypothesis, it is assumed that no variant is present and that any non-reference calls are due to noise. Given a Q20 base filter, the acceptable noise level is 1%. For simplicity, it is assumed that the expected number of nonreference calls due to noise should follow a Poisson distribution with a mean of λ = 0.01*N. The equation P = 1 - CDF(K -1, λ) represents the probability (P) of having K or more variant calls, where CDF is the cumulative distribution function of the Poisson distribution. P is the probability that no variant is present, given K or more observations. In this way, P is the theoretical false-positive rate, and this probability is converted to a Q-score with the maximum Q-score set to 100.

*It might seem more intuitive to evenly allocate expected noise the across the three non-reference channels; However, analysis has shown that noise is not evenly distributed across the channels, and is in fact heavily dependent on the reference and variant base types. Because of this, the more conservative null hypothesis is a better fit when using a flat noise model.

See the QScore recalibration section of this document for optional improvements to Q score calculations.

Frequency calculation

Simply put, the variant frequency we return is the total variant count divided by the coverage depth at the variant location. (Mathematically, the maximum likelihood estimator for lambda of a Poisson distribution is the observation count itself. Ie, the best estimate of our underlying variant frequency is what we observe. http://en.wikipedia.org/wiki/Poisson_distribution#Parameter_estimation).

While the math is uncontroversial, what we call "variant count" and "coverage count" deserves some discussion:

Coverage at a particular location is calculated as the sum of all called reads which pass filters (have sufficient mapping quality and base call quality). Coverage is given as DP in the vcf file. Deleted bases have counted towards coverage since version 3.5.5 . SNV count uses the same filters as the coverage count filters. Indel/MNV count is the sum of all reads passing filters where we were able to determine an indel/MNV. The estimate reference count is the coverage minus the variant count. (So, for an indel or MNV, the "reference count" is really the total number of passing reads that did not have the given variant). Variant frequency is calculated by dividing the variant count by the coverage at the given location*.

*The location used to determine the coverage is the exact SNV position.

** For MNVs and indels, the coverage calculation is more complicated.

Filtering

Any variants found by the algorithms described previously are reported to the output VCF files. These variants may be marked as PASS or filtered. Current filters available in Pisces are strand bias and qscore. The strand bias filter filters out suspicious variants found in one read direction and not another. If strand bias is found, sb[N] is written in the filter column of the .vcf, N being the filter cut-off. The qscore filter filters out suspicious variants found with a very low qscore. If the variant’s qscore is too low, a q[N] is written in the filter column of the .vcf.

Additional filtering is deliberately left to third-party software. When PISCES is used in conjunction with ISAS, this post-processing is applied by the Pisces wrapper, during the variant calling step. Such filters might be overlap, repeat, or additional qscore and strand bias filtering.

#Input Files

Pisces requires one or more BAM files as input or a parent BAM folder. These BAM files are assumed to be sorted, i.e. alignment reads should be sorted by mapped reference position. This assumption allows Pisces to process BAMs in one pass without a large memory footprint. Most standard aligners produce sorted BAM files. Positions in BAM files are expected to use 0-based coordinate system.

For performance, Pisces requires each BAM file to have an accompanying BAI index file. The index file allows Pisces to efficiently jump to chromosomes it is configured to call.

Pisces does not perform denovo variant calling. Therefore, it also requires a reference genome for each BAM file.

Optionally, the user can specify genomic regions of interest for which Pisces should produce allele calls. This is done by providing Picard-style interval files. This format lists intervals defined as a chromosome, and start and end positions, inclusive. Positions in interval files are expected to use 1-based coordinate system. Picard is a common, industry standard tool. An example of a line in an interval file is as follows, where the entries between brackets describe typical hidden characters.

chr1[tab]11299989[tab]11300989[end of line]

Pisces with Specific Aligners

Pisces has been validated for use with the Isas Amplicon Aligner, which is an implementation of the banded-smith waterman aligner.

It has been used internally with BWA-MEM and Isaac 2, without any known issues.

Pisces has no restrictions on indel length, and will call what ever is present in the cigar strings of the bam.

Pisces and QScore Recalibration

As discussed in the Qscore section, Pisces provides a variant call Qscore by calculating the chance false positive, given a Poisson noise model and the observations at hand. This provides a theoretical upper bound for the false positive rate. The true false positive rate is impacted by the sequencing platform, sample prep, and presence of FFPE degradation and oxidative damage. If the DNA is highly unstable, for example, being in advanced tumor state, it may also be difficult to align.

The following Isas workflows (TruSeqAmpliconWorker and Amplicon DS*) re-calibrate the variant call Q scores, by building a unique noise profile that is generated per sample. This re calibration is done by a separate stand-alone executable. It is not part of the Pisces executable. QScore recalibration has been found to significantly reduce the false positive rate (on average 73%, but it has been observed up to 98%) with no impact to the false negative rate. More information on Somatic QScore re-calibration can be found here. (forthcoming)

setting datatype default
variantcallerusesomaticqscorerecalibration bool off, most workflows
variantcallersomaticqscorerecalibrationthreshold int 3

Similar Software at Illumina

The TUNE workflow uses a version of starling for Tumor/Normal variant calling.

Validation

As part of the TSA release, Pisces was been internally validated against a legacy executable, detect.pl, and against a canonical dataset consisting of a series of cancer cell line dilutions at 100%, 10%, 5% 1% and 0%.

In follow on efforts, Pisces was been validated in 5% limit-of-detection studies for Amplicon DS, for specific panels of variants.

Simulations on a range of COSMIC variants, including SNVs, MNVs, insertions and deletions < 25 base pairs, suggest overall sensitivity and specificity in the high 90s.

General

5.2.10

5.2.9

5.2.7

5.2.5

5.2.0

5.1.6

5.1.3

Clone this wiki locally