-
Notifications
You must be signed in to change notification settings - Fork 21
Home
If you include the --color
flag in your PhiSpy.py
command, we will color all the genes in the genome (whether or not they are predicted to be in prophages) so you can visualize them in Artemis (see the Artemis PDF manual for more information on Artemis)
The colours we currently use are:
Color | Meaning |
---|---|
Red | This is an integrase or recombinase that likely marks a prophage insertion |
Blue | This is another phage protein other than integrase |
Pink | This is a mobile genetic element |
Light Grey | This is a protein with unknown (hypothetical) function |
Light Blue | This protein has a hit to your hmm database (note we only colour proteins that are not already red/blue/pink) |
PhiSpy
has the option of creating multiple output files with the prophage data:
- prophage_coordinates.tsv (code: 1)
This is the coordinates of each prophage identified in the genome, and their att sites (if found) in tab separated text format.
The columns of the file are:
-
- Prophage number
-
- The contig upon which the prophage resides
-
- The start location of the prophage
-
- The stop location of the prophage If we can detect the att sites, the additional columns are:
-
- start of attL;
-
- end of attL;
-
- start of attR;
-
- end of attR;
-
- sequence of attL;
-
- sequence of attR;
-
- The explanation of why this att site was chosen for this prophage.
- GenBank format output (code: 2)
We provide a duplicate GenBank record that is the same as the input record, but we have inserted the prophage information, including att sites into the record.
If the original GenBank file was provided in gzip
format this file will also be created in gzip format.
- prophage and bacterial sequences (code: 4)
PhiSpy
can automatically separate the DNA sequences into prophage and bacterial components. If this output is chosen, we generate both fasta and
GenBank format outputs:
- GenBank files: Two files are made, one for the bacteria and one for the phages. Each contains the appropriate fragments of the genome annotated as in the original.
-
fasta files: Two files are made, the first contains the entire genome, but the prophage regions have been masked with
N
s. We explicitly chose this format for a few reasons: (i) it is trivial to convert this format into separate contigs without the Ns but it is more complex to go from separate contigs back to a single joined contig; (ii) when read mapping against the genome, understanding that reads map either side of a prophage maybe important; (iii) when looking at insertion points this allows you to visualize the where the prophage was lying.
- prophage_information.tsv (code: 8)
This is a tab separated file, and is the key file to assess prophages in genomes (see assessing predictions, below). The file contains all the genes of the genome, one per line. The tenth colum represents the status of a gene. If this column is 0 then we consider this a bacterial gene. If it is non-zero it is probably a phage gene, and the higher the score the more likely we believe it is a phage gene. This is the raw data that we use to identify the prophages in your genome.
This file has 16 columns:
-
- The id of each gene;
-
- function: function of the gene (or
product
from a GenBank file);
- function: function of the gene (or
-
- contig;
-
- start: start location of the gene;
-
- stop: end location of the gene;
-
- position: a sequential number of the gene (starting at 1);
-
- rank: rank of each gene provided by random forest;
-
- my_status: status of each gene based on random forest;
-
- pp: classification of each gene based on their function;
-
- Final_status: the status of each gene. For prophages, this column has the number of the prophage as listed in prophage.tbl above; If the column contains a 0 we believe that it is a bacterial gene. Otherwise we believe that it is possibly a phage gene.
If we can detect the att sites, the additional columns are:
-
- start of attL;
-
- end of attL;
-
- start of attR;
-
- end of attR;
-
- sequence of attL;
-
- sequence of attR;
- prophage.tsv (code: 16)
This is a simpler version of the prophage_coordinates.tsv file that only has prophage number, contig, start, and stop.
- GFF3 format (code: 32)
This is the prophage information suitable for insertion into a GFF3. This is a legacy file format, however, since GFF3 is no longer widely supported, this only has the prophage coordinates. Please post an issue on GitHub if more complete GFF3 files are required.
- prophage.tbl (code: 64)
This file has two columns separated by tabs [prophage_number, location]. This is a also a legacy file that is not generated by default. The prophage number is a sequential number of the prophage (starting at 1), and the location is in the format: contig_start_stop that encompasses the prophage.
- test data (code: 128)
This file has the data used in the random forest. The columns are:
- Identifier
- Median ORF length
- Shannon slope
- Adjusted AT skew
- Adjusted GC skew
- The maxiumum number of ORFs in the same direction
- PHMM matches
- Status
The numbers are averaged across a window of size specified by --window_size
We have provided the option (--output_choice
) to choose which output files are created. Each file above has a code associated with it,
and to include that file add up the codes:
Code | File |
---|---|
1 | prophage_coordinates.tsv |
2 | GenBank format output |
4 | prophage and bacterial sequences |
8 | prophage_information.tsv |
16 | prophage.tsv |
32 | GFF3 format output of just the prophages |
64 | prophage.tbl |
128 | test data used in the random forest |
256 | GFF3 format output for the annotated genomic contigs |
So for example, if you want to get GenBank format output
(2) and prophage_information.tsv
(8), then
enter an --output_choice
of 10.
The default is 3: you will get both the prophage_coordinates.tsv
and GenBank format output
files.
Note: Choice 32
will only output the prophages themselves in GFF3 format. In contrast, choice 256
outputs
annotated genomes. This is probably the best choice to bring the genome into Artemis as it will handle multiple
contigs correctly.
If you want all files output, use --output_choice 512
.