Merge pull request #201 from gbouras13/v1.0.0

V1.0.0
gbouras13 · Sep 15, 2022 · 0fe0540 · 0fe0540
2 parents 94b67c1 + 959430e
commit 0fe0540
Show file tree

Hide file tree

Showing 34 changed files with 2,557 additions and 354,874 deletions.
diff --git a/HISTORY.md b/HISTORY.md
@@ -1,6 +1,25 @@
 History
 =======
 
+1.0.0 (2022-09-16)
+------------------
+
+* Removes errors (with post_processing functions not being parsed as strings) to improve robustness.
+* Codebase more reliable and consistent
+* Overhaul of install_databases.py
+* Adds pre-existing Pharokka Database available at https://zenodo.org/record/7081772
+
+0.1.11 (2022-09-13)
+------------------
+
+*  Adds CARD and VFDB databases.
+
+0.1.10 (2022-08-31)
+------------------
+
+* Fixes issues with Genbank output files.
+
+
 0.1.9 (2022-08-04)
 ------------------
 

diff --git a/README.md b/README.md
@@ -9,8 +9,22 @@ pharokka is designed for rapid standardised annotation of bacteriophages.
 
 If you are looking for rapid standardised annotation of prokaryotes, please use prokka (https://github.com/tseemann/prokka), which inspired the creation of pharokka.
 
-Method
-----
+Table of Contents
+-----------
+- [pharokka](#pharokka)
+  - [Fast Phage Annotation Program](#fast-phage-annotation-program)
+  - [Table of Contents](#table-of-contents)
+- [Method](#method)
+- [Installation](#installation)
+- [Beginner Conda Installation](#beginner-conda-installation)
+- [Usage](#usage)
+- [Version Log](#version-log)
+- [System](#system)
+- [Time](#time)
+- [Bugs and Suggestions](#bugs-and-suggestions)
+- [Citation](#citation)
+
+# Method
 
 ![pharokka workflow](img/pharokka_workflow.png?raw=true "Pharokka Workflow")
 
@@ -22,13 +36,13 @@ The other important output is `cds_functions.tsv`, which includes counts of CDSs
 
 For full documentation, please visit https://pharokka.readthedocs.io.
 
-Usage
-------
+# Installation
 
 **pharokka v0.1.11 is now available on bioconda**
 
-* v0.1.11 adds VFDB and CARD databases for virulence factor and AMR gene identification. 
-* These should install using the install_databases.py script. If this does not work, the additional databases can be found in the databases directory in this repository. These can then be copied into your desired database directory. See the Installation Section for more details.
+* v0.1.11 adds VFDB (current as of 15-09-22) and CARD (v3.2.4) databases for virulence factor and AMR gene identification.
+* These should install using the install_databases.py script.
+* If this does not work, you an alternatively download the databases from Zenodo at https://zenodo.org/record/7080544/files/pharokka_v0.1.11_databases.zip and unzip the directory in a location of your choice. Please see the Installation Section for more details.
 
 The easiest way to install pharokka is via conda.
 
@@ -60,8 +74,7 @@ install_databases.py -h
 pharokka.py -h
 ```
 
-Beginner Conda Installation
---------
+# Beginner Conda Installation
 
 If you are new to using the command-line, please install conda using the following instructions.
 
@@ -97,8 +110,7 @@ mamba create -n pharokkaENV pharokka
 conda activate pharokkaENV
 ```
 
-Running pharokka
---------
+# Usage
 
 First the PHROGs databases need to be installed
 
@@ -108,18 +120,19 @@ If you would like to specify a different database directory (recommended), that
 
 `install_databases.py -o <path/to/databse_dir>`
 
-If you have trouble downloading the databases using `install_databases.py`, they can be manually downloaded from the PHROGs website links, untared and placed in a directory of your choice:
-* https://phrogs.lmge.uca.fr/downloads_from_website/phrogs_mmseqs_db.tar.gz
-* https://phrogs.lmge.uca.fr/downloads_from_website/phrog_annot_v4.tsv.
+Version 0.1.11 adds VFDB and CARD databases for virulence factor and AMR gene identification. These should install using the install_databases.py script as outlined above. You will need to run this before running pharokka v0.1.11.
 
-Version 0.1.11 adds VFDB and CARD databases for virulence factor and AMR gene identification. These should install using the install_databases.py script as outlined above. If this does not work, the additional databases can be found in the databases directory in this github repository. These can then be copied into your desired database directory as follows:
+If this does not work, you an alternatively download the databases from Zenodo at https://zenodo.org/record/7080544/files/pharokka_v0.1.11_databases.zip and unzip the directory in a location of your choice.
+
+If you prefer to use the command line:
 
 ```
-git clone "https://github.com/gbouras13/pharokka.git"
-cd pharokka
-cp -r databases/* <path/to/databse_dir>
+wget "https://zenodo.org/record/7080544/files/pharokka_v0.1.11_databases.zip"
+unzip pharokka_v0.1.11_databases.zip
 ```
 
+which will create a directory called "pharokka_v0.1.11_databases" containing the databases.
+
 Once the databases have finished downloading, to run pharokka
 
 `pharokka.py -i <fasta file> -o <output folder> -t <threads>`
@@ -150,29 +163,31 @@ In v0.1.7, the ability to specify an E-value threshold for CDS functional assign
 
 pharokka defaults to 1 thread.
 
-Version Log
---------
+# Version Log
+
 A brief description of what is new in each update of pharokka can be found in the HISTORY.md file.
 
-System
-------
+# System
+
 pharokka has been tested on Linux and MacOS (M1 and Intel).
 
-Time
---------
+# Time
+
 On a standard 16GB RAM laptop specifying 8 threads, pharokka should take between 3-10 minutes to run for a single phage, depending on the genome size.
 
-Bugs and Suggestions
---------
+# Bugs and Suggestions
+
 If you come across bugs with pharokka, or would like to make any suggestions to improve the program, please open an issue or email [email protected]
 
-Citation
---------
+# Citation
+
 If you use pharokka, please also cite:
 
 * McNair K., Zhou C., Dinsdale E.A., Souza B., Edwards R.A. (2019) "PHANOTATE: a novel approach to gene identification in phage genomes", Bioinformatics, https://doi.org/10.1093/bioinformatics/btz26.
 * Chan, P.P., Lin, B.Y., Mak, A.J. and Lowe, T.M. (2021) "tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes", Nucleic Acids Res., https://doi.org/10.1093/nar/gkab688.
-* Steinegger M. and Soeding J. (2017), "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets", Nature Biotechnology (https://doi.org/10.1038/nbt.3988).
-* Terzian P., Olo Ndela E., Galiez C., Lossouarn J., Pérez Bucio R.E., Mom R., Toussaint A., Petit M.A., Enault F., "PHROG : families of prokaryotic virus proteins clustered using remote homology", NAR Genomics and Bioinformatics, (2021), (https://doi.org/10.1093/nargab/lqab067).
-* Bland C., Ramsey L., Sabree F., Lowe M., Brown K., Kyrpides N.C., Hugenholtz P. , "CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats", BMC Bioinformatics, (2007), (https://doi.org/10.1186/1471-2105-8-209).
-* Laslett D., Canback B., "ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences.", Nucleic Acids Res, (2004), (https://doi.org/10.1093/nar/gkh152).
+* Steinegger M. and Soeding J. (2017), "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets", Nature Biotechnology https://doi.org/10.1038/nbt.3988.
+* Terzian P., Olo Ndela E., Galiez C., Lossouarn J., Pérez Bucio R.E., Mom R., Toussaint A., Petit M.A., Enault F., "PHROG : families of prokaryotic virus proteins clustered using remote homology", NAR Genomics and Bioinformatics, (2021), https://doi.org/10.1093/nargab/lqab067.
+* Bland C., Ramsey L., Sabree F., Lowe M., Brown K., Kyrpides N.C., Hugenholtz P. , "CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats", BMC Bioinformatics, (2007), https://doi.org/10.1186/1471-2105-8-209.
+* Laslett D., Canback B., "ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences.", Nucleic Acids Research (2004) https://doi.org/10.1093/nar/gkh152.
+* Chen L., Yang J., Yao Z., Sun L., Shen Y., Jin Q., "VFDB: a reference database for bacterial virulence factors", Nucleic Acids Research (2005) https://doi.org/10.1093/nar/gki008.
+* Alcock et al, "CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database." Nucleic Acids Research (2020) https:doi.org/10.1093/nar/gkz935.
diff --git a/accessory/__init__.py b/accessory/__init__.py
diff --git a/bin/databases.py b/bin/databases.py
@@ -3,92 +3,102 @@
 import sys
 import subprocess as sp
 
+PHROG_DB_NAMES = ['phrogs_db','phrogs_db.dbtype',
+'phrogs_db.index',
+'phrogs_profile_db',
+'phrogs_profile_db.dbtype',
+'phrogs_profile_db.index',
+'phrogs_profile_db_consensus',
+'phrogs_profile_db_consensus.dbtype',
+'phrogs_profile_db_consensus.index',
+'phrogs_profile_db_h',
+'phrogs_profile_db_h.index',
+'phrogs_profile_db_seq',
+'phrogs_profile_db_seq.dbtype',
+'phrogs_profile_db_seq.index',
+'phrogs_profile_db_seq_h',
+'phrogs_profile_db_seq_h.index']
+
+VFDB_DB_NAMES = ['VFDB_setB_pro.fas',
+'vfdb',
+'vfdb.dbtype',
+'vfdb.index',
+'vfdb.lookup',
+'vfdb.source',
+'vfdb_h',
+'vfdb_h.dbtype',
+'vfdb_h.index']
+
+CARD_DB_NAMES = [
+'CARD',
+'CARD.dbtype',
+'CARD.index',
+'CARD.lookup',
+'CARD.source',
+'CARD_h',
+'CARD_h.dbtype',
+'CARD_h.index']
+
 def instantiate_install(db_dir):
     instantiate_dir(db_dir)
-    get_phrog_mmseqs(db_dir)
-    get_phrog_annot_table(db_dir)
-    get_vfdb(db_dir)
-    get_card(db_dir)
-
+    downloaded_flag = check_db_installation(db_dir)
+    if downloaded_flag == True:
+        print("All Databases have already been Downloaded and Checked")
+    else:
+        get_database_zenodo(db_dir)
 
 def instantiate_dir(db_dir):
     if os.path.isdir(db_dir) == False:
         os.mkdir(db_dir)
 
-def get_phrog_mmseqs(db_dir):
-    print("Getting PHROGs MMSeqs DB")
-    filepath = "https://phrogs.lmge.uca.fr/downloads_from_website/phrogs_mmseqs_db.tar.gz"
-    tarball = "phrogs_mmseqs_db.tar.gz"
-    folder = "phrogs_mmseqs_db"
-
-    # get tarball if not already present
-    if os.path.isfile(os.path.join(db_dir,tarball)) == True: 
-         print("PHROGs Database already downloaded")
-        # download tarball and untar
-    else:
-        try:
-            sp.call(["curl", filepath, "-o", os.path.join(db_dir,tarball)])
-        except:
-            sys.stderr.write("Error: PHROGs MMSeqs Database not found - link likely broken\n")  
-            return 0
+def check_db_installation(db_dir):
 
-    # delete folder if it exists already
-    if os.path.isfile(os.path.join(db_dir,folder)) == True:
-        sp.call(["rm", os.path.join(db_dir,folder)])
+    downloaded_flag = True
+    # PHROGS files
+    for file_name in PHROG_DB_NAMES:
+        path = os.path.join(db_dir, file_name)
+        if os.path.isfile(path) == False:
+            print("PHROGs Databases are missing. Pharokka Database Will be Downloaded")
+            downloaded_flag = False
+            break
+    # VFDB
+    for file_name in VFDB_DB_NAMES:
+        path = os.path.join(db_dir, file_name)
+        if os.path.isfile(path) == False:
+            print("VFDB Databases are missing. Pharokka Database Will be Downloaded")
+            downloaded_flag = False
+            break
+    # CARD
+    for file_name in CARD_DB_NAMES:
+        path = os.path.join(db_dir, file_name)
+        if os.path.isfile(path) == False:
+            print("CARD Databases are missing. Pharokka Database Will be Downloaded")
+            downloaded_flag = False
+            break
+    # annot.tsv
+    path = os.path.join(db_dir,'phrog_annot_v4.tsv')
+    if os.path.isfile(path) == False:
+            print("PHROGs Annotation File Needs to be Downloaded")
+            downloaded_flag = False
+
+    return downloaded_flag
 
-    # download untar -C for specifying the directory
-    sp.call(["tar", "-xzf", os.path.join(db_dir, tarball), "-C", db_dir])
-
-
-def get_phrog_annot_table(db_dir):
-    print("Getting PHROGs Annotation Table")
-    filepath = "https://phrogs.lmge.uca.fr/downloads_from_website/phrog_annot_v4.tsv"
-    file = "phrog_annot_v4.tsv"
-    #if the file already exists
-    if os.path.isfile(os.path.join(db_dir,file)) == True:
-        print("PHROGs annotation file already downloaded")
-    else:
-        try:
-            sp.call(["curl", filepath, "-o", os.path.join(db_dir,file)])
-        except:
-            sys.stderr.write("Error: PHROGs annotation file not found - link likely broken\n")  
-            return 0
-
-def get_vfdb(db_dir):
-    print("Getting VFDB Database")
-    filepath = "http://www.mgc.ac.cn/VFs/Down/VFDB_setB_pro.fas.gz"
-    file = "VFDB_setB_pro.fas.gz"
-    #if the file already exists
-    if os.path.isfile(os.path.join(db_dir,"vfdb", "vfdb")) == True:
-        print("VFDB already downloaded")
-    else:
-        try:
-            instantiate_dir(os.path.join(db_dir, "vfdb"))
-            sp.call(["curl", filepath, "-o", os.path.join(db_dir,"vfdb",file)])
-            sp.Popen(["gunzip",  os.path.join(db_dir,"vfdb", file)], stdout=sp.PIPE)
-            sp.call(["mmseqs", "createdb", os.path.join(db_dir, "vfdb", "VFDB_setB_pro.fas"), os.path.join(db_dir, "vfdb", "vfdb")])
-        except:
-            sys.stderr.write("Error: VFDB  not found - link likely broken\n")  
-            return 0
 
-def get_card(db_dir):
-    print("Getting CARD Database")
-    filepath = "https://card.mcmaster.ca/download/0/broadstreet-v3.2.4.tar.bz2"
-    file = "card.tar.bz2"
-    #if the file already exists
-    if os.path.isfile( os.path.join(db_dir, "CARD_mmseqs", "CARD")) == True:
-        print("CARD already downloaded")
-    else:
-        try:
-            # make the CARD dir
-            instantiate_dir(os.path.join(db_dir, "CARD"))
-            instantiate_dir(os.path.join(db_dir, "CARD_mmseqs"))
-            # download the database 
-            sp.call(["curl", filepath, "-o", os.path.join(db_dir,"CARD",file)])
-            # untar 
-            sp.call(["tar", "-xf", os.path.join(db_dir,"CARD",file), "-C",os.path.join(db_dir,"CARD") ])
-            # create mmseqs db
-            sp.call(["mmseqs", "createdb", os.path.join(db_dir, "CARD", "protein_fasta_protein_homolog_model.fasta"), os.path.join(db_dir, "CARD_mmseqs", "CARD")])
-        except:
-            sys.stderr.write("Error: CARD  not found - link likely broken\n")  
-            return 0
+def get_database_zenodo(db_dir):
+    print("Downloading Pharokka Database")
+    tarball = 'pharokka_v_1.0.0_databases.tar.gz'
+    url = "https://zenodo.org/record/7081772/files/pharokka_database_v1.0.0.tar.gz"
+    try:
+        # remvoe the directory
+        sp.call(["rm", "-rf", os.path.join(db_dir)])
+        # make db dir
+        sp.call(["mkdir", "-p", os.path.join(db_dir)])
+        # download the tarball
+        sp.call(["curl", url, "-o", os.path.join(db_dir,tarball)])
+        # untar tarball into database directory
+        sp.call(["tar", "-xzf", os.path.join(db_dir, tarball), "-C", db_dir, "--strip-components=1"])
+        # remove tarball
+        sp.call(["rm","-f", os.path.join(db_dir,tarball)])
+    except:
+        sys.stderr.write("Error: Pharokka Database Install Failed. \n Please try again or use the manual option detailed at https://github.com/gbouras13/pharokka.git \n downloading from https://zenodo.org/record/7081772/files/pharokka_database_v1.0.0_databases.tar.gz")  
+        return 0
diff --git a/bin/input_commands.py b/bin/input_commands.py
@@ -23,7 +23,7 @@ def get_input():
 	parser.add_argument('-l', '--locustag', action="store", help='User specified locus tag for the gff/gbk files. This is not required. A random locus tag will be generated instead.',  default='Default')
 	parser.add_argument('-g', '--gene_predictor', action="store", help='User specified gene predictor. Use "-g phanotate" or "-g prodigal". Defaults to phanotate (not required unless prodigal is desired).',  default='phanotate' )
 	parser.add_argument('-m', '--meta', help='Metagenomic option for Prodigal', action="store_true")
-	parser.add_argument('-c', '--coding_table', help='translation table for prodigal', action="store", default = "11")
+	parser.add_argument('-c', '--coding_table', help='translation table for prodigal. Defaults to 11. Experimental only.', action="store", default = "11")
 	parser.add_argument('-e', '--evalue', help='E-value threshold for mmseqs2. Defaults to 1E-05', action="store", default = "1E-05")
 	parser.add_argument('-V', '--version', help='Version', action='version', version=v)
 	args = parser.parse_args()
@@ -60,6 +60,7 @@ def instantiate_dirs(output_dir, force):
 def validate_fasta(filename):
 	with open(filename, "r") as handle:
 		fasta = SeqIO.parse(handle, "fasta")
+		print("Checking Input FASTA")
 		if any(fasta):
 			print("FASTA checked")
 		else: