A tumor-normal variant-calling workflow for HiFi data
See our tech note here on somatic variant detection with HiFi sequencing.
- Table of contents
- Installation and Dependencies
- Usage
- Important outputs from workflow
- Demo datasets
- References
- Tools versions
- Change logs
- DISCLAIMER
The workflow is written in WDL version 1.0 (Workflow Description Language). It depends on miniwdl
and singularity
(version 3 and later). miniwdl
can be installed using Bioconda. The workflow can also be run using Cromwell
. (See here for instructions, tested on version 86).
A step-by-step tutorial demonstrating usage and an FAQ can be found here.
The workflow will generate the following (non-exhaustive list) results in the $OUTDIR/_LAST/out
folder. Please refer to output for a more detailed description of the outputs. An example of final HTML report from COLO829 dataset can be found here (Right click to save the file and double click to open it in a web browser).
Folder | Types of results |
---|---|
AnnotatedSeverusSV | Severus structural variants annotated with AnnotSV (TSV, see AnnotSV README) |
Annotated*SV_intogen | Structural variants annotated with AnnotSV (TSV) overlapping with the Compendium of Cancer Genes (IntOGen May 23) |
small_variant_vcf_annotated | DeepSomatic or ClairS SNV/INDEL annotated with VEP (VCF, single entry per variant with --pick , see VEP documentation) |
small_variant_tsv_annotated | VEP annotation for SNV/INDEL in TSV format |
small_variant_tsv_CCG | SNV/INDEL that are in the Compendium of Cancer Genes (IntOGen May 23) |
mutsig_SNV_profile | Mutational profile plot (MutationalPattern) |
mutsig_SNV | Mutational signature in TSV format (MutationalPattern) |
normal_germline_small_variant_vcf_annotated | ClairS/Clair3 germline SNV/INDEL (In normal sample) annotated with VEP (Optional, see input JSON parameters) |
DMR_annotated | Differentially methylated region annotated with genes/introns/promoters etc (TSV) |
DMR_results | Raw differentially methylated region from DSS (Unannotated, TSV) |
DMR_annotated_CCG | Annotated DMR (>50 CpG sites) overlapping with the Compendium of Cancer Genes (IntOGen May 23) |
mosdepth_normal_summary | Depth of coverage of normal (TXT) |
mosdepth_tumor_summary | Depth of coverage of tumor (TXT) |
normal_bams_phased | Phased normal BAM file (Hiphase) |
tumor_bams_hiphase | Phased tumor BAM file (Hiphase) |
tumor_bams_longphase | Phased tumor BAM file (Optionally, use Longphase for phasing. See input JSON parameters) |
overall_(tumor|normal)_alignment_stats | Alignment overall statistics (Mapped %) |
per_alignment_(tumor|normal)_stats | Statistics (accuracy/n_mismatches/length) for each alignment |
aligned_RL_summary_(tumor|normal) | Aligned read length N50, mean and median |
normal_germline_small_variant_vcf | Germline variants in normal (VCF) |
tumor_germline_small_variant_vcf | Germline variants in tumor (VCF) |
pileup_(normal/tumor)_bed | Summarized 5mC probability in normal and tumor (BED, see pb-CpG-tools for format description) |
cnvkit_cns_with_major_minor_CN | Copy number segments adjusted with purity and ploidy estimate, see cnvkit_output for raw CNVKit result (BED) |
Severus_filtered_vcf | Severus structural variants (filtered with control VCF and has simple annotation based on svpack ) |
small_variant_vcf | DeepSomatic or ClairS SNV/INDEL (Unannotated VCF) |
Purple_outputs | Purity and ploidy estimate + allele-specific copy number calls from HMFtools suite |
chord_hrd_prediction | Homologous recombination deficiency (HRD) prediction using CHORD |
report | HTML report summarizing the results. This can be open in any modern web browser. The report is only generated if all steps in the pipeline is carried out (e.g. small variants calling, SV annotation) |
There are two cancer cell lines sequenced on Revio systems, provided by PacBio:
- COLO829 (60X tumor, 60X normal): https://downloads.pacbcloud.com/public/revio/2023Q2/COLO829
- HCC1395 (60X tumor, 40X normal): https://downloads.pacbcloud.com/public/revio/2023Q2/HCC1395/
More datasets and benchmarking can be found on GitHub page of Severus and DeepSomatic's preprint.
References of tools used
Following are the references for the tools used in the workflow, which should be cited if you use the workflow. The list may not be exhaustive; we welcome suggestions for additional references.
- Zheng, Z. et al. ClairS: a deep-learning method for long-read somatic small variant calling. 2023.08.17.553778 Preprint at https://doi.org/10.1101/2023.08.17.553778 (2023).
- English, A. C., Menon, V. K., Gibbs, R. A., Metcalf, G. A. & Sedlazeck, F. J. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biology 23, 271 (2022).
- Wang, T. et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature 604, 437–446 (2022).
- Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
- Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
- Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single molecule sequencing. Nat Methods 15, 461–468 (2018).
- Pedersen, B. S. & Quinlan, A. R. Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics 34, 867–868 (2018).
- McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biology 17, 122 (2016).
- Park, Y. & Wu, H. Differential methylation analysis for BS-seq data under general experimental design. Bioinformatics 32, 1446–1453 (2016).
- Talevich, E., Shain, A. H., Botton, T. & Bastian, B. C. CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing. PLoS Comput Biol 12, e1004873 (2016).
- Martínez-Jiménez, F. et al. A compendium of mutational cancer driver genes. Nat Rev Cancer 20, 555–572 (2020). https://www.intogen.org
- Manders, F. et al. MutationalPatterns: the one stop shop for the analysis of mutational processes. BMC Genomics 23, 134 (2022).
- Lin, J.-H., Chen, L.-C., Yu, S.-C. & Huang, Y.-T. LongPhase: an ultra-fast chromosome-scale phasing algorithm for small and large variants. Bioinformatics 38, 1816–1822 (2022).
- HMFtools suite (Amber, Cobalt and Purple): https://github.com/hartwigmedical/hmftools/tree/master.
- Park, J. et al. DeepSomatic: Accurate somatic small variant discovery for multiple sequencing technologies. 2024.08.16.608331 Preprint at https://doi.org/10.1101/2024.08.16.608331 (2024).
- Elrick, H. et al. SAVANA: reliable analysis of somatic structural variants and copy number aberrations in clinical samples using long-read sequencing. 2024.07.25.604944 Preprint at https://doi.org/10.1101/2024.07.25.604944 (2024).
- Nguyen, L., W. M. Martens, J., Van Hoeck, A. & Cuppen, E. Pan-cancer landscape of homologous recombination deficiency. Nat Commun 11, 5584 (2020).
- Keskus, A. et al. Severus: accurate detection and characterization of somatic structural variation in tumor genomes using long reads. 2024.03.22.24304756 Preprint at https://doi.org/10.1101/2024.03.22.24304756 (2024).
Tools used in the workflow
Tool | Version | Purpose | Container |
---|---|---|---|
pbmm2 | 1.14.99 | Alignment of HiFi reads | quay.io |
pbtk | 3.1.0 | Merging HiFi reads | quay.io |
samtools | 1.17 | Various tasks manipulating BAM files | quay.io |
VEP | 110.1 | Annotation of small variants | docker |
AnnotSV | 3.4.12 | Annotation of structural variants | quay.io |
DSS | 2.48.0 | Differential methylation | self-hosted on quay.io |
annotatr | 1.26.0 | Annotation of differentially methylated region (DMR) | self-hosted on quay.io |
ClairS | 0.3.0 | Somatic SNV and INDEL caller | docker |
bcftools | 1.17 | Manipulation of VCF | quay.io |
CNVKit | 0.9.10 | Copy number segmentation | quay.io |
Truvari | 4.0.0 | Filtering of control structural variants (Deprecated, using svpack instead) | quay.io |
bedtools | 2.31.0 | Splitting genome intervals for parallelization | quay.io |
mosdepth | 0.3.4 | Calculating depth of coverage | quay.io |
pb-CpG-tools | 2.3.1 | Summarizing 5mC probability | quay.io |
HiPhase | 1.4.5 | Diploid phasing using germline variants | quay.io |
slivar | 0.3.0 | Selecting/filtering variants from VCF | quay.io |
Severus | 1.2 | Structural variants | quay.io |
seqkit | 2.5.1 | Aligned BAM statistics | quay.io |
csvtk | 0.27.2 | Aligned BAM statistics summary and other CSV/TSV operation | quay.io |
IntOGen | May 31 2023 | Compendium of Cancer Genes for annotation | self-hosted on quay.io |
MutationalPattern | 3.10.0 | Mutational signatures based on SNV | quay.io |
Longphase | v1.5.2 | Optional phasing tool | quay.io |
Amber | v4.0 | BAF segmentation (HMFtools suite) | self-hosted on quay.io |
Cobalt | v1.16.0 | Log ratio segmentation (HMFtools suite) | self-hosted on quay.io |
Purple | v4.0 | Purity and ploidy estimate, somatic CNV (HMFtools suite) | self-hosted on quay.io |
DeepSomatic | v1.7.0 | Somatic SNV/INDELs caller | docker |
CHORD | v2.0.0 | HRD prediction | docker |
SAVANA | v1.2.3 | Structural variants and copy number variants caller | quay.io |
Click to expand changelogs:
-
v0.8.1:
- Move BND square bracket annotation for VCF to INFO field to avoid AnnotSV from harmonizing the BND format (Does not work well with long-reads SVs).
- Fixed a bug preventing
skip_align
from working properly. - Added option to produce SAVANA output. This is experimental and can be enabled with
hifisomatic.run_savana
in the input JSON.- Note that currrently the output for SAVANA is not annotated or used for any further downstream processing.
- Updated Severus to 1.2.0.
bcftools norm
on small variants before annotation.- Updated HiPhase to 1.4.5.
- Updated reference list.
-
v0.8:
- Updated DeepSomatic to v1.7.0. This resulted in a significant improve in INDEL recall. See benchmark from DeepSomatic preprint for more comparisons.
- Updated AnnotSV to 3.4.2. Please update AnnotSV cache by following the instructions in the step-by-step tutorial here.
- Updated pbmm2 to 1.14.99 (With
-A2
option for better alignment of some complex SV with short supplementary segments, e.g. truthset_41 in COLO829). - Updated Severus to version 1.1.
- Note that in COLO829
truthset_19
becomes a FN with Severus 1.1. See issue here.
- Note that in COLO829
- Updated report format to become easier to read.
- Resource bundle now uses germline SVs called with Severus instead of the previous Sniffles2 SV set. Please update resource bundle from Zenodo.
- Simplified kinetics stripping directly in pbmm2.
- Modified Amber to use the same pcf gamma as Cobalt. It was previously using a value of 100, while Cobalt was using 1000. This change will make the segmentation more consistent between Amber and Cobalt and should improve purity/ploidy estimates.
- Better logic with merging BAMs in the workflow (No more redundant merging when n_bam=1).
- Suppressed a warning causing failures in Cobalt with CIGAR error in the BAM file. This is known.
- Incorporated pull request from here for Cromwell on Azure (not tested).
-
v0.7:
- Updated DeepSomatic to v1.6.1.
- Pipeline now calls DeepSomatic in chunks.
-
v0.6.2:
- Added experimental CHORD HRD (Homologous Recombination Deficiency) prediction. See here for details.
- Renamed some variables (legacy Sniffles parameters).
- Swap CNVkit visualization to Purple in report.
- Made changes in WDL for compatibility with Cromwell.
- Allow specifying min/max purity/ploidy for Purple (In task WDL)
-
v0.6.1:
- Updated documentation and benchmark.
- Small bugfix in tabix indexing of VCFs.
- Added HTML report for summary metrics.
- Fixed a bug where germline VCFs are not output when using DeepSomatic.
- Fixed a bug introduced in svpack filtering where entries with
SVLEN=0
. are filtered out (They should not be).
-
v0.6:
- Added DeepSomatic 1.6.0 (Experimental and disabled by default. Enable with
hifisomatic.use_deepsomatic
in input JSON). Note that DeepSomatic is computationally expensive compared to ClairS so we recommend disabling it if computational resources are limited. See benchmark here for comparison between ClairS and DeepSomatic. - Use
svpack
to filter for control SVs (previously using truvari) and provide simple annotation in filtered VCF. - Switch to using
samtools
to strip kinetics.
- Added DeepSomatic 1.6.0 (Experimental and disabled by default. Enable with
-
v0.5:
- Updated Cobalt to 4.0. It now counts read depth correctly. See here for details.
- Containers are now on pacbio quay.io.
- SV calling now only uses Severus.
- Individual tasks now output the version number in stdout.
-
v0.4:
- Added purity, ploidy and somatic CNV with Amber, Cobalt and Purple
- Note that Cobalt doesn't count the read depth from long-reads correctly so it'll affect the segmentation accuracy. However, purity and ploidy estimation appears to be robust.
- CNVKit segmentation results recalled with purity and ploidy estimate from Purple.
- Severus release now updated with Bioconda container.
- Fixed an issue when call_smallvariants is set to false (Issue #1).
- Added purity, ploidy and somatic CNV with Amber, Cobalt and Purple
-
v0.3:
- Added IntOGen filtering of SV/SNV/INDEL/DMR.
- Added mutational signature analysis.
- Added germline small variants annotation with VEP (optional).
- Added Longphase as an optional phasing tool.
- Better documentation of output in output.
-
v0.2:
- Downgraded to WDL 1.0 for better compatibility.
- Added run time attribute to tasks for future support on cloud (not tested yet).
-
v0.1: Initial release.
TO THE GREATEST EXTENT PERMITTED BY APPLICABLE LAW, THIS WEBSITE AND ITS CONTENT, INCLUDING ALL SOFTWARE, SOFTWARE CODE, SITE-RELATED SERVICES, AND DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. ALL WARRANTIES ARE REJECTED AND DISCLAIMED. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THE FOREGOING. PACBIO IS NOT OBLIGATED TO PROVIDE ANY SUPPORT FOR ANY OF THE FOREGOING, AND ANY SUPPORT PACBIO DOES PROVIDE IS SIMILARLY PROVIDED WITHOUT REPRESENTATION OR WARRANTY OF ANY KIND. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A REPRESENTATION OR WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACBIO.