Annotation of VCF file for import into VarFish (through Web UI).
- ExAC r1.0
- gnomAD exomes r2.1
- gnomAD genomes r2.1
- clinvar
- thousand genomes
- hgmd_public from ENSEMBL r75
VarFish annotator uses HTSJDK for reading variant call format (VCF) files. HTSJDK supports reading VCF v4.3 so the output of any tool that produces well-formed VCF can be read. VCF itself only specifies relatively few required fields and callers may use fields in a slightly different way. We thus document below what fields are used/interpreted by VarFish Annotator to prepare the files for VarFish.
The following fields are considered:
CHROM
Will be written out in a normalized way, depending on the genome build. That thechr
prefix will be presenet for GRCh38 and it will be absent for GRCh37.POS
The 1-based chromosomal position is written out.REF
The reference allele will be written out.ALT
For each alternative allele and each gene, a record is written out. Asterisk alleles will be ignored (see Broad's GATK documentation on this).FORMAT
and perSAMPLE
GT
The genotype is written out with phasing information.AD
The allelic depth of the alternative allele of the current record is written out.DP
The total depth is written out to compute the allelic balance of the alternative allele of the current record is written out.GQ
The genotype quality is written out.
Supported Callers and Caller Annotation
The following variant callers are explicitely supported.
- Delly 2 (SVs)
- Dragen CNV caller
- Dragen SV caller
- Manta
- GATK gCNV
- XHMM (deprecated)
In the other cases, VarFish annotator will fall back to a "generic" import where only the per-sample fields GT
, FT
, and GQ
are interpreted.
Your caller should also write out INFO/END
, INFO/SVTYPE
, and INFO/SVLEN
as defined by VCF4.2
VarFish Annotator will look at the field INFO/SVMETHOD
to annotate calls with the caller where the call originated from.
If this field is empty then you should define --default-sv-method
so you get appropriately labeled output.
If you have any problem with your data then please tell us by opening a GitHub issue.
Interpretation of top-level and INFO VCF fields
The following fields are considered:
CHROM
Will be written out in a normalized way, depending on the genome build. That thechr
prefix will be presenet for GRCh38 and it will be absent for GRCh37.POS
The 1-based chromosomal position is written out.REF
The reference allele will be written out.ALT
For each alternative allele and each gene, a record is written out. Asterisk alleles will be ignored (see Broad's GATK documentation on this).INFO/CHR2
Second chromosome of the SV, if not on the same chromosome.INFO/END
End position of the SV.INFO/CT
Paired-endsignature induced connection type.INFO/CIPOS
Confidence interval around the start point of the SV.INFO/CIEND
Confidence interval around the end point of the SV.INFO/SVMETHOD
The name of the caller that was used.
Interpretation of FORMAT
and per sample fields
- Common
GT
Genotype, written asgt
FT
Per-genotype filter values, written asft
GQ
Phred-scaled genotype quality, written asgq
- Delly2
DR
Reference pairs, written aspec = DR + DV
DV
Variant pairs, written aspev
RR
Reference junction count, written assrc = RR + RV
RV
Variant junction count, written assrv
RDCN
Copy number estimate, written ascn
- Dragen CNV
SM
Average normalized overage, written asanc
BC
Bucket count, written as point countpc
PE
Discordante read count at start/end, written aspev = PE[0] + PE[1]
- Dragen SV
PR
Paired read of reference and variant, written aspec = PR[0] + PR[1]
andpev = PR[1]
SR
Paired read of reference and variant, written assrc = SR[0] + SR[1]
andsrv = SR[1]
- For GATK gCNV
CN
Integer copy number, written ascn
NP
Number of points in segment, written asnp
- Manta (equivalent to Dragen SV)
- For XHMM
RD
Average normalized coveage, written asan
The following will create varfish-annotator-db-1906.h2.db
and fill it.
# DOWNLOAD=path/to/varfish-db-downloader
# ANNOTATOR_VERSION=0.9
# ANNOTATOR_DATA_RELEASE=1907
# java -jar varfish-annotator-cli-$ANNOTATOR_VERSION-SNAPSHOT.jar \
init-db \
--db-release-info "varfish-annotator:v$ANNOTATOR_VERSION" \
--db-release-info "varfish-annotator-db:r$ANNOTATOR_DATA_RELEASE" \
\
--ref-path /fast/projects/cubit/18.12/static_data/reference/GRCh37/hs37d5/hs37d5.fa \
\
--db-release-info "clinvar:2019-02-20" \
--clinvar-path $DOWNLOAD/GRCh37/clinvar/latest/clinvar_tsv_main/output/clinvar_allele_trait_pairs.single.b37.tsv \
--clinvar-path $DOWNLOAD/GRCh37/clinvar/latest/clinvar_tsv_main/output/clinvar_allele_trait_pairs.multi.b37.tsv \
\
--db-path ./varfish-annotator-db-$ANNOTATOR_DATA_RELEASE \
\
--db-release-info "exac:r1.0" \
--exac-path $DOWNLOAD/GRCh37/ExAC/r1/download/ExAC.r1.sites.vep.vcf.gz \
\
--db-release-info "gnomad_exomes:r2.1" \
$(for path in $DOWNLOAD/GRCh37/gnomAD_exomes/r2.1/download/gnomad.exomes.r2.1.sites.chr*.normalized.vcf.bgz; do \
echo --gnomad-exomes-path $path; \
done) \
\
--db-release-info "gnomad_genomes:r2.1" \
$(for path in $DOWNLOAD/GRCh37/gnomAD_genomes/r2.1/download/gnomad.genomes.r2.1.sites.chr*.normalized.vcf.bgz; do \
echo --gnomad-genomes-path $path; \
done) \
\
--db-release-info "thousand_genomes:v3.20101123"
$(for path in $DOWNLOAD/GRCh37/thousand_genomes/phase3/ALL.chr*.phase3_shapeit2_mvncall_integrated_v5a.20130502.sites.vcf.gz; do \
echo --thousand-genomes-path $path; \
done) \
\
--db-release-info "hgmd_public:ensembl_r75" \
--hgmd-public $DOWNLOAD/GRCh37/hgmd_public/ensembl_r75/HgmdPublicLocus.tsv
# mvn com.coveo:fmt-maven-plugin:format -Dverbose=true
The folder /tests
contains some data sets that are appropriate for system (aka "end-to-end") tests of the software.
hg19-chr22
-- This folder contains examples for annotating GATK HC and Delly2 calls on the first 20MB of chr22. Only the variants overlapping withADA2
andGAB4
are used.
You can build the data sets with the build.sh
script that is available in each folder.
This script also serves for documenting the test data's provenance.
The Jannovar software must be available as jannovar
(e.g., through bioconda) on your PATH
and you will need samtools
.
The tests use junit5-system-exit for detecting System.exit()
calls.
In JDK 18 you have to use the -Djava.security.manager=allow
flag.
Issue tginsberg/junit5-system-exit#10 tracks this issue.
There is an issue with removing temporary directories on Windows.
Apparently, HTSJDK does not properly close files.
Set -Djunit.jupiter.tempdir.cleanup.mode.default=NEVER
to work around this issue.