Skip to content

Commit

Permalink
Merge pull request #201 from gbouras13/v1.0.0
Browse files Browse the repository at this point in the history
V1.0.0
  • Loading branch information
gbouras13 authored Sep 15, 2022
2 parents 94b67c1 + 959430e commit 0fe0540
Show file tree
Hide file tree
Showing 34 changed files with 2,557 additions and 354,874 deletions.
19 changes: 19 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,25 @@
History
=======

1.0.0 (2022-09-16)
------------------

* Removes errors (with post_processing functions not being parsed as strings) to improve robustness.
* Codebase more reliable and consistent
* Overhaul of install_databases.py
* Adds pre-existing Pharokka Database available at https://zenodo.org/record/7081772

0.1.11 (2022-09-13)
------------------

* Adds CARD and VFDB databases.

0.1.10 (2022-08-31)
------------------

* Fixes issues with Genbank output files.


0.1.9 (2022-08-04)
------------------

Expand Down
77 changes: 46 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,22 @@ pharokka is designed for rapid standardised annotation of bacteriophages.

If you are looking for rapid standardised annotation of prokaryotes, please use prokka (https://github.com/tseemann/prokka), which inspired the creation of pharokka.

Method
----
Table of Contents
-----------
- [pharokka](#pharokka)
- [Fast Phage Annotation Program](#fast-phage-annotation-program)
- [Table of Contents](#table-of-contents)
- [Method](#method)
- [Installation](#installation)
- [Beginner Conda Installation](#beginner-conda-installation)
- [Usage](#usage)
- [Version Log](#version-log)
- [System](#system)
- [Time](#time)
- [Bugs and Suggestions](#bugs-and-suggestions)
- [Citation](#citation)

# Method

![pharokka workflow](img/pharokka_workflow.png?raw=true "Pharokka Workflow")

Expand All @@ -22,13 +36,13 @@ The other important output is `cds_functions.tsv`, which includes counts of CDSs

For full documentation, please visit https://pharokka.readthedocs.io.

Usage
------
# Installation

**pharokka v0.1.11 is now available on bioconda**

* v0.1.11 adds VFDB and CARD databases for virulence factor and AMR gene identification.
* These should install using the install_databases.py script. If this does not work, the additional databases can be found in the databases directory in this repository. These can then be copied into your desired database directory. See the Installation Section for more details.
* v0.1.11 adds VFDB (current as of 15-09-22) and CARD (v3.2.4) databases for virulence factor and AMR gene identification.
* These should install using the install_databases.py script.
* If this does not work, you an alternatively download the databases from Zenodo at https://zenodo.org/record/7080544/files/pharokka_v0.1.11_databases.zip and unzip the directory in a location of your choice. Please see the Installation Section for more details.

The easiest way to install pharokka is via conda.

Expand Down Expand Up @@ -60,8 +74,7 @@ install_databases.py -h
pharokka.py -h
```

Beginner Conda Installation
--------
# Beginner Conda Installation

If you are new to using the command-line, please install conda using the following instructions.

Expand Down Expand Up @@ -97,8 +110,7 @@ mamba create -n pharokkaENV pharokka
conda activate pharokkaENV
```

Running pharokka
--------
# Usage

First the PHROGs databases need to be installed

Expand All @@ -108,18 +120,19 @@ If you would like to specify a different database directory (recommended), that

`install_databases.py -o <path/to/databse_dir>`

If you have trouble downloading the databases using `install_databases.py`, they can be manually downloaded from the PHROGs website links, untared and placed in a directory of your choice:
* https://phrogs.lmge.uca.fr/downloads_from_website/phrogs_mmseqs_db.tar.gz
* https://phrogs.lmge.uca.fr/downloads_from_website/phrog_annot_v4.tsv.
Version 0.1.11 adds VFDB and CARD databases for virulence factor and AMR gene identification. These should install using the install_databases.py script as outlined above. You will need to run this before running pharokka v0.1.11.

Version 0.1.11 adds VFDB and CARD databases for virulence factor and AMR gene identification. These should install using the install_databases.py script as outlined above. If this does not work, the additional databases can be found in the databases directory in this github repository. These can then be copied into your desired database directory as follows:
If this does not work, you an alternatively download the databases from Zenodo at https://zenodo.org/record/7080544/files/pharokka_v0.1.11_databases.zip and unzip the directory in a location of your choice.

If you prefer to use the command line:

```
git clone "https://github.com/gbouras13/pharokka.git"
cd pharokka
cp -r databases/* <path/to/databse_dir>
wget "https://zenodo.org/record/7080544/files/pharokka_v0.1.11_databases.zip"
unzip pharokka_v0.1.11_databases.zip
```

which will create a directory called "pharokka_v0.1.11_databases" containing the databases.

Once the databases have finished downloading, to run pharokka

`pharokka.py -i <fasta file> -o <output folder> -t <threads>`
Expand Down Expand Up @@ -150,29 +163,31 @@ In v0.1.7, the ability to specify an E-value threshold for CDS functional assign

pharokka defaults to 1 thread.

Version Log
--------
# Version Log

A brief description of what is new in each update of pharokka can be found in the HISTORY.md file.

System
------
# System

pharokka has been tested on Linux and MacOS (M1 and Intel).

Time
--------
# Time

On a standard 16GB RAM laptop specifying 8 threads, pharokka should take between 3-10 minutes to run for a single phage, depending on the genome size.

Bugs and Suggestions
--------
# Bugs and Suggestions

If you come across bugs with pharokka, or would like to make any suggestions to improve the program, please open an issue or email [email protected]

Citation
--------
# Citation

If you use pharokka, please also cite:

* McNair K., Zhou C., Dinsdale E.A., Souza B., Edwards R.A. (2019) "PHANOTATE: a novel approach to gene identification in phage genomes", Bioinformatics, https://doi.org/10.1093/bioinformatics/btz26.
* Chan, P.P., Lin, B.Y., Mak, A.J. and Lowe, T.M. (2021) "tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes", Nucleic Acids Res., https://doi.org/10.1093/nar/gkab688.
* Steinegger M. and Soeding J. (2017), "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets", Nature Biotechnology (https://doi.org/10.1038/nbt.3988).
* Terzian P., Olo Ndela E., Galiez C., Lossouarn J., Pérez Bucio R.E., Mom R., Toussaint A., Petit M.A., Enault F., "PHROG : families of prokaryotic virus proteins clustered using remote homology", NAR Genomics and Bioinformatics, (2021), (https://doi.org/10.1093/nargab/lqab067).
* Bland C., Ramsey L., Sabree F., Lowe M., Brown K., Kyrpides N.C., Hugenholtz P. , "CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats", BMC Bioinformatics, (2007), (https://doi.org/10.1186/1471-2105-8-209).
* Laslett D., Canback B., "ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences.", Nucleic Acids Res, (2004), (https://doi.org/10.1093/nar/gkh152).
* Steinegger M. and Soeding J. (2017), "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets", Nature Biotechnology https://doi.org/10.1038/nbt.3988.
* Terzian P., Olo Ndela E., Galiez C., Lossouarn J., Pérez Bucio R.E., Mom R., Toussaint A., Petit M.A., Enault F., "PHROG : families of prokaryotic virus proteins clustered using remote homology", NAR Genomics and Bioinformatics, (2021), https://doi.org/10.1093/nargab/lqab067.
* Bland C., Ramsey L., Sabree F., Lowe M., Brown K., Kyrpides N.C., Hugenholtz P. , "CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats", BMC Bioinformatics, (2007), https://doi.org/10.1186/1471-2105-8-209.
* Laslett D., Canback B., "ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences.", Nucleic Acids Research (2004) https://doi.org/10.1093/nar/gkh152.
* Chen L., Yang J., Yao Z., Sun L., Shen Y., Jin Q., "VFDB: a reference database for bacterial virulence factors", Nucleic Acids Research (2005) https://doi.org/10.1093/nar/gki008.
* Alcock et al, "CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database." Nucleic Acids Research (2020) https:doi.org/10.1093/nar/gkz935.
5 changes: 0 additions & 5 deletions accessory/__init__.py

This file was deleted.

168 changes: 89 additions & 79 deletions bin/databases.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,92 +3,102 @@
import sys
import subprocess as sp

PHROG_DB_NAMES = ['phrogs_db','phrogs_db.dbtype',
'phrogs_db.index',
'phrogs_profile_db',
'phrogs_profile_db.dbtype',
'phrogs_profile_db.index',
'phrogs_profile_db_consensus',
'phrogs_profile_db_consensus.dbtype',
'phrogs_profile_db_consensus.index',
'phrogs_profile_db_h',
'phrogs_profile_db_h.index',
'phrogs_profile_db_seq',
'phrogs_profile_db_seq.dbtype',
'phrogs_profile_db_seq.index',
'phrogs_profile_db_seq_h',
'phrogs_profile_db_seq_h.index']

VFDB_DB_NAMES = ['VFDB_setB_pro.fas',
'vfdb',
'vfdb.dbtype',
'vfdb.index',
'vfdb.lookup',
'vfdb.source',
'vfdb_h',
'vfdb_h.dbtype',
'vfdb_h.index']

CARD_DB_NAMES = [
'CARD',
'CARD.dbtype',
'CARD.index',
'CARD.lookup',
'CARD.source',
'CARD_h',
'CARD_h.dbtype',
'CARD_h.index']

def instantiate_install(db_dir):
instantiate_dir(db_dir)
get_phrog_mmseqs(db_dir)
get_phrog_annot_table(db_dir)
get_vfdb(db_dir)
get_card(db_dir)

downloaded_flag = check_db_installation(db_dir)
if downloaded_flag == True:
print("All Databases have already been Downloaded and Checked")
else:
get_database_zenodo(db_dir)

def instantiate_dir(db_dir):
if os.path.isdir(db_dir) == False:
os.mkdir(db_dir)

def get_phrog_mmseqs(db_dir):
print("Getting PHROGs MMSeqs DB")
filepath = "https://phrogs.lmge.uca.fr/downloads_from_website/phrogs_mmseqs_db.tar.gz"
tarball = "phrogs_mmseqs_db.tar.gz"
folder = "phrogs_mmseqs_db"

# get tarball if not already present
if os.path.isfile(os.path.join(db_dir,tarball)) == True:
print("PHROGs Database already downloaded")
# download tarball and untar
else:
try:
sp.call(["curl", filepath, "-o", os.path.join(db_dir,tarball)])
except:
sys.stderr.write("Error: PHROGs MMSeqs Database not found - link likely broken\n")
return 0
def check_db_installation(db_dir):

# delete folder if it exists already
if os.path.isfile(os.path.join(db_dir,folder)) == True:
sp.call(["rm", os.path.join(db_dir,folder)])
downloaded_flag = True
# PHROGS files
for file_name in PHROG_DB_NAMES:
path = os.path.join(db_dir, file_name)
if os.path.isfile(path) == False:
print("PHROGs Databases are missing. Pharokka Database Will be Downloaded")
downloaded_flag = False
break
# VFDB
for file_name in VFDB_DB_NAMES:
path = os.path.join(db_dir, file_name)
if os.path.isfile(path) == False:
print("VFDB Databases are missing. Pharokka Database Will be Downloaded")
downloaded_flag = False
break
# CARD
for file_name in CARD_DB_NAMES:
path = os.path.join(db_dir, file_name)
if os.path.isfile(path) == False:
print("CARD Databases are missing. Pharokka Database Will be Downloaded")
downloaded_flag = False
break
# annot.tsv
path = os.path.join(db_dir,'phrog_annot_v4.tsv')
if os.path.isfile(path) == False:
print("PHROGs Annotation File Needs to be Downloaded")
downloaded_flag = False

return downloaded_flag

# download untar -C for specifying the directory
sp.call(["tar", "-xzf", os.path.join(db_dir, tarball), "-C", db_dir])


def get_phrog_annot_table(db_dir):
print("Getting PHROGs Annotation Table")
filepath = "https://phrogs.lmge.uca.fr/downloads_from_website/phrog_annot_v4.tsv"
file = "phrog_annot_v4.tsv"
#if the file already exists
if os.path.isfile(os.path.join(db_dir,file)) == True:
print("PHROGs annotation file already downloaded")
else:
try:
sp.call(["curl", filepath, "-o", os.path.join(db_dir,file)])
except:
sys.stderr.write("Error: PHROGs annotation file not found - link likely broken\n")
return 0

def get_vfdb(db_dir):
print("Getting VFDB Database")
filepath = "http://www.mgc.ac.cn/VFs/Down/VFDB_setB_pro.fas.gz"
file = "VFDB_setB_pro.fas.gz"
#if the file already exists
if os.path.isfile(os.path.join(db_dir,"vfdb", "vfdb")) == True:
print("VFDB already downloaded")
else:
try:
instantiate_dir(os.path.join(db_dir, "vfdb"))
sp.call(["curl", filepath, "-o", os.path.join(db_dir,"vfdb",file)])
sp.Popen(["gunzip", os.path.join(db_dir,"vfdb", file)], stdout=sp.PIPE)
sp.call(["mmseqs", "createdb", os.path.join(db_dir, "vfdb", "VFDB_setB_pro.fas"), os.path.join(db_dir, "vfdb", "vfdb")])
except:
sys.stderr.write("Error: VFDB not found - link likely broken\n")
return 0

def get_card(db_dir):
print("Getting CARD Database")
filepath = "https://card.mcmaster.ca/download/0/broadstreet-v3.2.4.tar.bz2"
file = "card.tar.bz2"
#if the file already exists
if os.path.isfile( os.path.join(db_dir, "CARD_mmseqs", "CARD")) == True:
print("CARD already downloaded")
else:
try:
# make the CARD dir
instantiate_dir(os.path.join(db_dir, "CARD"))
instantiate_dir(os.path.join(db_dir, "CARD_mmseqs"))
# download the database
sp.call(["curl", filepath, "-o", os.path.join(db_dir,"CARD",file)])
# untar
sp.call(["tar", "-xf", os.path.join(db_dir,"CARD",file), "-C",os.path.join(db_dir,"CARD") ])
# create mmseqs db
sp.call(["mmseqs", "createdb", os.path.join(db_dir, "CARD", "protein_fasta_protein_homolog_model.fasta"), os.path.join(db_dir, "CARD_mmseqs", "CARD")])
except:
sys.stderr.write("Error: CARD not found - link likely broken\n")
return 0
def get_database_zenodo(db_dir):
print("Downloading Pharokka Database")
tarball = 'pharokka_v_1.0.0_databases.tar.gz'
url = "https://zenodo.org/record/7081772/files/pharokka_database_v1.0.0.tar.gz"
try:
# remvoe the directory
sp.call(["rm", "-rf", os.path.join(db_dir)])
# make db dir
sp.call(["mkdir", "-p", os.path.join(db_dir)])
# download the tarball
sp.call(["curl", url, "-o", os.path.join(db_dir,tarball)])
# untar tarball into database directory
sp.call(["tar", "-xzf", os.path.join(db_dir, tarball), "-C", db_dir, "--strip-components=1"])
# remove tarball
sp.call(["rm","-f", os.path.join(db_dir,tarball)])
except:
sys.stderr.write("Error: Pharokka Database Install Failed. \n Please try again or use the manual option detailed at https://github.com/gbouras13/pharokka.git \n downloading from https://zenodo.org/record/7081772/files/pharokka_database_v1.0.0_databases.tar.gz")
return 0
3 changes: 2 additions & 1 deletion bin/input_commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ def get_input():
parser.add_argument('-l', '--locustag', action="store", help='User specified locus tag for the gff/gbk files. This is not required. A random locus tag will be generated instead.', default='Default')
parser.add_argument('-g', '--gene_predictor', action="store", help='User specified gene predictor. Use "-g phanotate" or "-g prodigal". Defaults to phanotate (not required unless prodigal is desired).', default='phanotate' )
parser.add_argument('-m', '--meta', help='Metagenomic option for Prodigal', action="store_true")
parser.add_argument('-c', '--coding_table', help='translation table for prodigal', action="store", default = "11")
parser.add_argument('-c', '--coding_table', help='translation table for prodigal. Defaults to 11. Experimental only.', action="store", default = "11")
parser.add_argument('-e', '--evalue', help='E-value threshold for mmseqs2. Defaults to 1E-05', action="store", default = "1E-05")
parser.add_argument('-V', '--version', help='Version', action='version', version=v)
args = parser.parse_args()
Expand Down Expand Up @@ -60,6 +60,7 @@ def instantiate_dirs(output_dir, force):
def validate_fasta(filename):
with open(filename, "r") as handle:
fasta = SeqIO.parse(handle, "fasta")
print("Checking Input FASTA")
if any(fasta):
print("FASTA checked")
else:
Expand Down
Loading

0 comments on commit 0fe0540

Please sign in to comment.