-
Notifications
You must be signed in to change notification settings - Fork 7
Home
Welcome to the Cenote-Taker 2 wiki!
With any suggestions to improve Cenote-Taker 2 or if you believe you have verifiable viral contigs that the pipeline is not detecting, please email me at [email protected]
Hopefully, the instructions provided in the repo readme provide enough guidance on installing Cenote-Taker 2 on your linux HPC or cloud. I always suggest saving a log file of the install to check for errors that you may not notice until trying to run Cenote-Taker 2 later.
There are two test datasets provided in the repo. One contains contigs from DNA viruses, the other contains contigs from RNA viruses. Info below was generated from Cenote-Taker 2 v2.1.5
This file contains five contigs. Here's what they are and what to expect from Cenote-Taker 2: testcontig1 = a circular sequence of a novel microvirus-like sequence quite divergent from current GenBank entries. Should find genome map in the main sequin_and_genome_maps/
testcontig2 = a circular sequence of a known circovirus (porcine circovirus 1). Should find genome map in the main sequin_and_genome_maps/
testcontig3 = a linear fragment of a novel phycodnavirus quite divergent from current GenBank entries. Should find genome map in sequin_and_genome_maps/
testcontig4 = a linear fragment of a known crAss-like phage containing several viral hallmark genes. Should find genome map in sequin_and_genome_maps/
test_contig5 = a linear fragment of a known crAss-like phage with only "accessory" or non-hallmark genes. With default settings, this will not be annotated. If you use -am True
, a genome map will be found in sequin_and_genome_maps/
What the output genome maps should look like:
Command that I ran:
python run_cenote-taker2.py -c testcontigs_DNA_ct2.fasta -r test_DNA_ct -p True -m 32 -t 32
Note: this command cannot be copy/pasted into your terminal. You must specify the correct path to your Cenote-Taker 2 directory and script.
Output "CONTIG_SUMMARY.tsv" file:
ORIGINAL_NAME CENOTE_NAME ORGANISM_NAME END_FEATURE LENGTH ORF_CALLER NUM_HALLMARKS HALLMARK_NAMES BLASTN_INFO
testcontig1 test_DNA_ct1 Petitvirales sp. ctNWH1 DTR 6097 Prodigal 1 Varsani_microviridae_cluster.14/PFAM:PF02305.16-Major-capsid-protein BLASTN not conducted
testcontig2 test_DNA_ct2 Circovirus sp. ctsq52 DTR 1760 Prodigal 1 circoviridae_cluster.1/PFAM:PF02443.14-capsid-protein BLASTN not conducted
testcontig3 test_DNA_ct3_vs01 Phycodnaviridae sp. ctSNK3_01 None 9783 Prodigal 1 phycodna_cluster.4/PDB:5TIP_B-Major-capsid-protein BLASTN not conducted
testcontig4 test_DNA_ct4_vs01 crAss-like phage sp. ctyu04_01 None 18176 PHANOTATE 3 crass_cluster.2/PFAM:PF16510.4-portal-protein|crass_cluster.3/PDB:3EZK_B-Terminase-large-subunit|crass_cluster.1/PFAM:PF17236.1-Major-Capsid-Protein BLASTN not conducted
This file contains two contigs. Here's what they are and what to expect from Cenote-Taker 2:
testcontig6 = a linear sequence from a known human astrovirus. Should find genome map in the main sequin_and_genome_maps/
testcontig7 = a linear sequence from a novel levivirus-like sequence. Should find genome map in the main sequin_and_genome_maps/
What the output genome maps should look like:
Command that I ran:
python run_cenote-taker2.py -c testcontigs_RNA_ct2.fasta -r test_RNA_ct -p True -m 32 -t 32 -db rna_virus
Output "CONTIG_SUMMARY.tsv" file:
ORIGINAL_NAME CENOTE_NAME ORGANISM_NAME END_FEATURE LENGTH ORF_CALLER NUM_HALLMARKS HALLMARK_NAMES BLASTN_INFO
testcontig7 test_RNA_ct2_vs01 Norzivirales sp. ctFyT2_01 None 3422 PHANOTATE 1 levi_cluster.2/PFAM:PF03431.12-RDRP BLASTN not conducted
testcontig6 test_RNA_ct1_vs01 Mamastrovirus sp. ctevq1_01 None 6415 Prodigal 2 astro_cluster.1/PDB:5I61_A-RDRP|astro_cluster.2/PFAM:PF03115.13-capsid BLASTN not conducted
Note: this command cannot be copy/pasted into your terminal. You must specify the correct path to your Cenote-Taker 2 directory and script.
The .gbf files produced by successful runs can be opened with any genome/plasmid viewer. Free softwares include Ugene and SnapGene Viewer. Paid softwares include Geneious, MacVector, and SnapGene. Others probably exist. Note that changes to .gbf files will not be reflected on the .sqn files that are submittable to GenBank.
-p True
trims down input contigs that contain phage genomes to remove non-phage bacterial chromosomes
Original contig:
^^^^BACTERIAL CHROM^^^^^\/\/\/\/\/\/PHAGE GENOME\/\/\/\/\/\/\
^^^^BACTERIAL CHROM^^^^^
"Pruned" sequence after pruning module:
\/\/\/\/\/\/PHAGE GENOME\/\/\/\/\/\/\
For each provided set of contigs, ORFs are translated and compared to a Hidden Markov Model database of viral hallmark genes using hmmscan.
In this case, multiple sequence alignments of related amino acid sequences are prepared. Then a probabilistic model of the position and likelihood of conserved amino acid residues is created using hmmer (i.e. a Hidden Markov Model). Many models go into a database.
I define this as a gene only found in some viruses but never in non-viral sequences. I also ONLY consider genes with known function. The classic example of this is a virus capsid gene.
I considered all available virus proteins from GenBank. I also considered protein sequences from a variety of assembled DNA viromes. Pairwise alignment of sequences was conducted using EFI-EST. Clusters of related sequences were clustered with the MCL add-on in Cytoscape. Clusters were extracted, multiple sequence alignments (MSAs) were prepared with MAFFT. MSAs were used as queries for HHsearch. Best hits with probability > 85% were used to name HMMs, which were then prepared with hmmer. HMMs identified as viral replication, packaging or virion proteins were kept. HMMs were tested for specificity to to viral proteins by querying known NON-VIRAL proteins with a permissive E value. HMMs with hits to non-viral proteins were removed. HMMs with coil-coil and Ig-like domains were also removed as these can generate non-specific hits. Some viral HMMs from CDD and Pfam were also used.
Plasmids and conjugative transposons can swap genes with viruses, most notably dsDNA phage. I've taken steps to identify these in the pipeline via gene composition analysis and BLASTX.
The adding more CPUs should improve the speed of processing datasets.
The best way to increase the speed for users mainly interested in virus discovery is to turn off HHsuite via -hh none
. I would recommend doing this by default if you want to run many large datasets as quickly as possible. Or, for those just interested in quick discovery without generation of genome maps, run python unlimited_breadsticks.py
.
This is really up to the user. Typically, assemblers generate lots of small contigs and few long contigs. Changing minimum consideration from 5000 nt to 1000 nt may increase the total amount of sequence to be analyzed by several fold. Changing '--minimum_length_linear' has the greatest effect as most sequences are not circular. Therefore, the decision is really just one of how many resources do you have to throw at the dataset, how long are you willing to wait, and how important are small contigs to you?
Hard limits are 1000 nt for detection of circularity and 4000nt for detection of ITRs. Smaller contigs will still be considered as 'linear' contigs if you have lower minimum settings for '--minimum_length_linear'.
There are many methods for enriching for viral nucleic acid, most of which involve enriching physical viral particles. Checking your purity with ViromeQC or other bioinformatic tools could inform which discovery thresholds are trustworthy.
Minimum hallmark genes for circular/ITR sequences (--circ_minimum_hallmark_genes
):
Circularity itself is a feature common to some viral families (but also to plasmids) that can clue you in to the nature of the element at hand (same goes for inverted terminal repeats, i.e. ITRs). Because some viruses are so divergent or independently derived from known viruses (example), analyzing all circular sequences in a virus enriched set can be fruitful. To do this, set --circ_minimum_hallmark_genes 0
. Inevitably, some "circles" without hallmark genes will be false positives, so be vigilant. Otherwise, set this to --circ_minimum_hallmark_genes 1
.
Minimum hallmark genes for linear sequences (--lin_minimum_hallmark_genes):
For enriched datasets, I'd set this at --lin_minimum_hallmark_genes 1
. Most viral contigs in your dataset will represent incomplete genome fragments. Further, large DNA viruses have long "accessory regions" that don't contain any genes I'd consider to be "hallmarks". These accessory regions will be missed, unfortunately. However, even in samples that appear to be highly enriched for virus particles, there will be non-viral elements. The nature of these elements is yet-to-be-determined, but include PICIs.
Such datasets have a lot more 'noise' than 'signal', and using a little more caution might be beneficial.
First of all, please always use -p True
with these samples. My experience suggests that around half of long viral contigs from unenriched WGS metagenome datasets are flanked by bacterial chromosome sequences on at least one side.
If your data contain linear large eukaryotic DNA viruses and you've elected to prune prophages, these sequences may have gotten over-chopped or over-pruned. Check the .pdf generated by the pruning script. You can rerun these specific contigs with -p False
to recover the entire contig.
I'd suggest using -db virion
for unenriched datasets/bacterial genomes, as replication genes can occasionally be exchanged with other mobile genetic elements, and a small false-positive rate might be observed.
Minimum hallmark genes for circular/ITR sequences (--circ_minimum_hallmark_genes
):
I'd suggest using --circ_minimum_hallmark_genes 1
Minimum hallmark genes for linear sequences (--lin_minimum_hallmark_genes
):
I'd suggest using --lin_minimum_hallmark_genes 2
, but 1 could be used.
Suppose you've used some other method to infer which of your contigs are viral. We can skip the discovery step and just do annotation with -am True
:
Note: other arguments (such as -hh
are up to the user, but each should be considered.
Detection of circularity in contigs assembled from Illumina data relies on the fact that most assemblers (e.g. SPAdes, MEGAhit, IDBA_UD) will append direct terminal repeat sequences (equal in length to the largest k-mer used in assembly) to the ends of a contig. Some assemblers will not do this (e.g. CLC Bio), and circularity cannot be determined. Bear in mind that circularity cannot be distinguished from presence of direct repeats on a linear DNA molecule if the repeat sequences is greater in length than the largest k-mer used for assembly.
SPAdes with "Vanilla" settings will occasionally concatenate two ssDNA viral genomes that share a highly conserved hairpin region. Using the '--meta' and '--plasmid' SPAdes options seem to eliminate this problem.
Cenote-Taker 2 does not try to bin separate segments of segmented viruses. I understand people might look for contigs with identical terminal sequences to group segments. I don't know how to prepare GenBank files for this type of sequence.
The basic premise for this module is that it quickly annotates each gene on a contig, scores each position based on the annotation (+10 for common virus genes, +5 for hypothetical genes, -3 for common chromosomal genes, 0 for intergenic), and marks the location of virus hallmark genes. Then, sliding 5kb windows are scored and smoothed. Positive "chunks" of windows with the user-determined minimum number of virus hallmark genes get kept as viral sequences. Everything else is discarded.
I've tested the prophage pruning module on most of the caudoviruses in RefSeq. These should NOT be cut or trimmed by the module, as they contain purely viral sequence. Encouragingly, ~97% of these genomes remain intact, whereas the other ~3% get cut or trimmed in regions where they've picked up genes from bacterial chromosomes that are not commonly found in other caudoviruses.
Error can occur the other way as well, when flanking chromosomal regions don't get fully trimmed. I've seen this occasionally when the virus is integrated adjacent to 'hypothetical' genes or genes that are common in both chromosomal and viral sequences (e.g. some topoisomerases). Ideas on 'rules' I could implement to improve the accuracy of the prophage pruning are most welcome. For example, I'm unaware of how I could accurately identify 'Att' sites. Relatedly, integrase genes are often found near one extreme of an integrated virus's genome. However, there are sometimes genes between the integrase and the integration site, so a hard-and-fast rule seems inappropriate.
Virus taxonomy is extremely challenging, particularly for divergent viruses. Cenote-Taker 2 looks at annotations to find the "best" gene to make a taxonomy call (e.g. RdRp for RNA viruses, terminase or portal for tailed phage). The amino acid sequence of said gene is compared to a database of amino acids from taxonomically defined viruses (mostly RefSeq). This is mostly correct, but finer resolution can be obtained using other, lower throughput methods. VContact2 is typically insightful for bacteriophage (this tool is also available with GUI on CyVerse).
Standard AA sequence approach thresholds:
Hallmark AAI to Reference | Taxonomic granularity from CT2 |
---|---|
>90% | Genus, e.g. "Ilzatvirus" |
>40% | Family, e.g. "Siphoviridae" |
>25% | Order, e.g. "Caudovirales" |
=<25% | Generic name, e.g. "phage" |
With BLASTN module, each contig in its entirety will be aligned to the reference database, and if a contig aligns to a reference genome with at least 95% ANI and 85% AF (alignment fraction), the contig will be considered to belong the same species as the reference.
I'm aware that ICTV has updated and changed phage taxonomy significantly and NCBI has not yet followed their lead. Unfortunately, I am unaware of how I could use ICTV data for taxonomy calls.
See this repo's release notes for more recent updates.
CHANGE LOG - March 19, 2020
Created 'patch1_2.0.1' branch, made changes, then merged back to master Fixed path to BLASTN database. Improved BLASTN results reporting in .gbf as well as table output. Even close matches to entry in BLASTN database will be annotated with same modules as divergent contigs. Fixed a bug where contigs with all ORFs being annotated with custom viral HMMs were misformatted. Fixed some minor things with taxonomy calls. This includes attempts to identify conjugative transposons that have similar replication modules to dsDNA phage. If no virion/packaging genes are present but at least one conjugative transposon-associated gene is present, contig is labeled as conjugative transposon. Added a wiki with some but not all relvant info on pipeline. Reformatted README with new schematic and updated info. Added DNA test contigs and RNA test contigs.
CHANGE LOG - April 21, 2020
Created 'patch4_2.0.1' branch, made changes, then merged back to master. Fixed formatting bug that cut off certain RPSBLAST result descriptions. Modified all RPSBLAST searches to '-seg yes' which filters out low-complexity hits Added ~600 pVOG HMMs from here, mostly to bolster viral hallmarks of bacteriophage. Added Quenyavirus RDRP model to rna_virus HMM database and some manually-trimmed helicase domains from small DNA viruses. Added the ORF calling info (PHANOTATE or Prodigal) to the summary .tsv file.
CHANGE LOG - May 7, 2020
Made minor adjustments to the prophage pruning module to allow pruning annotation of up to 99 prophage from a single parent contig. Added coordinates to pruned prophage .fasta header corresponding with parent contig.
CHANGE LOG - September 15, 2020
Made fairly major edits of the virus hallmark gene HMM database. One change was that models that contained a polyprotein or fusion protein were trimmed down to only contain the hallmark gene domain (e.g. models with major capsid protein fused with capsid maturation protease were trimmed to only have the major capsid protein domain). This effected many RNA virus models and some dsDNA models. Also, genes from GenBank identified by VirFam as phage virion genes but missed by the old Cenote-Taker 2 HMMs were identified (Thanks, Igor Tolstoy), and used to generate new HMMs. A small number of models were removed due to non-specific hits.
Also, RPS-BLAST was parallelized more efficiently in the module for linear virus contig annotation.
CHANGE LOG - October 2, 2020
An option "virion" for the "--virus_domain_db" argument was added. With this option, only hallmark genes encoding virion structural (e.g. capsid) genes, packaging genes, and virion maturation genes will be used in the hallmark count (in regards to "--lin_minimum_hallmark_genes" and "--circ_minimum_hallmark_genes"). A bug where Cenote-Taker wasn't finding read coverage information for linear contigs was fixed. Several changes were made that don't effect the output of the program, but the fixes prevent "ugly" bash errors from being spit out to the terminal. Edits tested on 'Patch7_2.0.1' branch.
CHANGE LOG - March 4, 2021
I've made some very large changes to the code. Mostly, these changes increase paralellization of BLAST, HMMER, HHblits/HHpred, and the pruning module. Further, all genome maps for the entire run can be found in 'sequin_and_genome_maps/'. I've added annotation mode: -am True
, discovery only mode: unlimited_breadsticks.py
, and improved the output tables. Also, about 100 hallmark gene HMMs were removed for being non specific to viruses, and about 50 new hallmark genes were added from divergent genomes identified with SEEKER. With these changes, I've made Release v2.1.1
Edits tested on 'Patch8_2.0.1' branch.