New Repo

New Repo location
GW-HIVE · Jul 22, 2019 · 0035341 · 0035341
commit 0035341
Show file tree

Hide file tree

Showing 15 changed files with 1,021 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,119 @@
+# Filtered NCBI-nt in FASTA format
+
+Filtered NT dataset is generated by excluding sequences from the whole
+nt file provided by NCBI, based on whether they have unwanted taxonomy 
+names or any child taxonomy name of these unwanted ones. These unwanted 
+taxonomy names are listed in the black list generated by two steps: 
+(1) Getting all taxonomy names which contain the strings listed 
+below (Step 3); (2) Getting all possible child taxonomy names of each 
+of the taxonomy names from (1). For example, "other sequences" 
+(taxId: 28384) is excluded with all its child taxonomy names including 
+"artificial sequence", "vector", "synthetic", and so on.
+
+We have chosen to apply the Creative Commons Attribution 3.0
+Unsupported License to this version of the software.
+
+
+
+|Version | Downloadable Files | File Size | Release Notes|NCBI Download Date|
+|--------|--------------------|-----------|--------------|------------------|
+|Vesrion 6.0| [Filtered NT v6.0](https://hive.biochemistry.gwu.edu/prd/filterednt//content/filtered_nt_July_2018.fasta)| 168G|[Release Notes v6](https://hive.biochemistry.gwu.edu/filterednt/releasenotesv6)|July 2018|
+|Version 5.0|[Filtered_NT v5.0](https://hive.biochemistry.gwu.edu/prd//filterednt/content/Filtered_NTv5.0.fasta)|131G|[Release Notes v5.0](https://hive.biochemistry.gwu.edu/filterednt/releasenotesv5)|May 2017|
+|Version 4.0| [Filtered NT v4.0](https://hive.biochemistry.gwu.edu/prd//filterednt/content/Filtered_NTv4.0.fasta)|110G|[Release Notes v4.0](https://hive.biochemistry.gwu.edu/filterednt/releasenotesv4)|July 2016|
+
+
+
+
+# Summary of the protocol
+
+************************************************************************
+## Step 1. Download the whole nt file
+************************************************************************
+downloaded from: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/
+version: 5/21/2017
+command:
+    wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz
+    gunzip nt.gz (42,439,338 rows)
+
+************************************************************************
+## Step 2. Download the taxonomy list 
+************************************************************************
+downloaded from: ftp://ftp.ncbi.nih.gov/pub/taxonomy/
+version: 5/21/2017; 5/30/2017
+command:
+	wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/*.gz
+	gunzip *.gz
+	wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
+	gunzip taxdump.tar.gz |tar -xvf
+location: /data/projects/targetdbs/downloads/
+
+
+************************************************************************
+## Step 3. Generate black list
+************************************************************************
+protocol: unwanted taxonomy names (scientific names) from names.dmp and
+		all child taxonomy names of them, include:
+		['unclassified','unidentified','uncultured', \
+			 'unspecified','unknown','phage','vector']
+		['environmental sample','artificial sequence','other sequence']
+
+	  There are two steps for generating the black list, first is to
+		get all taxonomy names with the strings above, and then
+		to get all child taxonomy names of them.
+
+script: /projects/targetdbs/scripts/get-parent-taxid-of-blacklist.py
+	/projects/targetdbs/scripts/get-child-taxid-of-blacklist.py
+
+output: /data/projects/targetdbs/generated/blacklist-taxId.1.csv
+	/data/projects/targetdbs/generated/blacklist-taxId.2.csv
+
+	After generating blacklist-taxId.2.txt, use command line 
+	"sort -u" to delete duplicated records, and store them into:
+	/data/projects/targetdbs/generated/blacklist-taxId.unique.csv
+
+QC script: /projects/targetdbs/scripts/compare-old-new-blacklist.py
+		Compare the newly generated with the older version.
+
+
+************************************************************************
+## Step 4. Check the completion of taxonomy list (QC)
+************************************************************************
+protocol: First check if all seqAcs in nt file have taxIds from 
+	nucl_gb.accession2taxid file, and the ones do not have taxIds
+	are checked in all other ac2taxid files.
+script: /projects/targetdbs/scripts/check-ac2taxid-completion-step1.py
+	/projects/targetdbs/scripts/check-ac2taxid-completion-step2.py
+	/projects/targetdbs/scripts/check-ac2taxid-completion-step3.py
+output: /data/projects/targetdbs/generated/logfile.step1.txt
+	/data/projects/targetdbs/generated/logfile.step2.txt
+	/data/projects/targetdbs/generated/logfile.step3.txt
+
+This step needs a lot of memory. Suggest to run on large machine. 
+        123 records of PDB accessions have extra characters, fixed 
+	that in step3.py.
+	However, 28 records are not in the files, search taxIds
+	manually for them (/data/projects/targetdbs/generated/ \
+	logfile.step3.manually.added.txt).
+
+
+************************************************************************
+## Step 5. Get the seqAc-taxonomy list
+************************************************************************
+protocol: Exclude those taxIds in the blacklist. And first get all 
+	seqAc-taxIds from nucl_gb.accession2taxid, and all of other
+	ac2taxid files from both version 05/21/2017 and 05/30/2017.
+script: /projects/targetdbs/scripts/get-seqac2taxid.py
+output: /data/projects/targetdbs/generated/logfile.ac2taxid.list.txt
+QC step: All seqAcs in nt files are mapped to at least one taxId. The
+	number of seqAcs in the list matches the one in nt file.
+	SeqAcs with multiple taxIds are listed in:
+	/data/projects/targetdbs/generated/seqAc-with-multiple-taxids.txt
+
+
+************************************************************************
+## Step 6. Filtering nt file
+************************************************************************
+protocol: Remember to add those manually added ac2taxids.
+script: /projects/targetdbs/scripts/filter-nt.py
+output: /data/projects/targetdbs/generated/filtered_nt_Jun06-2017.fasta
+QC script: /projects/targetdbs/scripts/check-removed-seqacs-count.py
diff --git a/Release_note_May31_2017 b/Release_note_May31_2017
@@ -0,0 +1,61 @@
+
+************************************************************************
+* Downloaded Files
+************************************************************************
+1. nt file downloaded on 5/21/2017 
+   ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/)
+
+	42,439,338 	sequences
+
+
+2. names.dump downloaded on 5/21/2017 
+   ftp://ftp.ncbi.nih.gov/pub/taxonomy/
+
+	2,383,434 	names
+	1,601,859 	scientific names
+
+
+3. ac2taxid files 
+   ftp://ftp.ncbi.nih.gov/pub/taxonomy/
+
+	#records	file name
+	39,775,235	nucl_gss.accession2taxid.2017-05-21
+	122,045,527	nucl_gb.accession2taxid.2017-05-21
+	76,436,508	nucl_est.accession2taxid.2017-05-21
+	361,700,039	nucl_wgs.accession2taxid.2017-05-21
+	12,406,761	ac2taxid.2017-05-30/dead_nucl.accession2taxid
+	39,775,235	ac2taxid.2017-05-30/nucl_gss.accession2taxid
+	122,236,860	ac2taxid.2017-05-30/nucl_gb.accession2taxid
+	76,436,632	ac2taxid.2017-05-30/nucl_est.accession2taxid
+	66,696,868	ac2taxid.2017-05-30/dead_wgs.accession2taxid
+	381,019		ac2taxid.2017-05-30/pdb.accession2taxid
+	362,474,815	ac2taxid.2017-05-30/nucl_wgs.accession2taxid
+
+
+************************************************************************
+* Filter statistics
+************************************************************************
+Number of taxonomy ids that are in black list is 378,341.
+
+Sequences from a given black list of sources were removed. This list
+of sources, number of associated taxonomic IDs and number 
+of removed sequences is given below.
+
+
+	blackListTaxonomyName	#taxids	#removered sequences
+	=====================	=======	====================
+	unidentified		49	97
+	uncultured		1	2
+	unknown			342	1026
+	unspecified		68	11435
+	unclassified		182192	847187
+	other sequence		12666	233354
+	phage			4594	8445
+	environmental sample	50697	6398042
+	unknown-manually	1	4
+	====================	=====	=======
+	total			250610	7499592
+
+
+The number of sequences in this filtered-nt release is 
+	34,939,806
diff --git a/check-ac2taxid-completion-step1.py b/check-ac2taxid-completion-step1.py
@@ -0,0 +1,53 @@
+import os,sys
+import string
+from optparse import OptionParser
+from Bio import SeqIO
+import glob
+import MySQLdb
+import csv
+
+
+__version__="1.0"
+__status__ = "Dev"
+
+
+
+###############################
+def main():
+
+
+	patList = "/data/projects/targetdbs/filtered-nt/downloads/nucl_*.accession2taxid.2017-05-21"
+        fileList = glob.glob(patList)
+	ntFile = "/data/projects/targetdbs/filtered-nt/downloads/nt.2017-05-21"
+
+	FW = open("/data/projects/targetdbs/filtered-nt/generated/logfile.step1.txt", "w")
+	ac2taxid = {}
+	for fileName in fileList:
+                if fileName.find("nucl_gb.accession2taxid") >= 0:
+			i = 0
+			with open(fileName, 'rb') as csvfile:
+                		csvreader = csv.reader(csvfile, delimiter='\t', quotechar='|')
+                		for row in csvreader:
+                        		seqAc = row[0].strip()
+       					ac2taxid[seqAc] = 1
+					if i%10000000 == 0:
+						print "Done loading ", fileName, i
+					i += 1
+
+
+	i = 0
+        for record in SeqIO.parse(ntFile, "fasta"):
+                seqAc = record.id
+		seqAc = seqAc.split('.')[0]
+		if seqAc not in ac2taxid:
+			FW.write("No taxid found for: %s\n" % (seqAc))
+		if i%10000000 == 0:
+			print "Done parsing",  i
+		i += 1
+
+	FW.close()
+
+
+if __name__ == '__main__':
+        main()
+
diff --git a/check-ac2taxid-completion-step2.py b/check-ac2taxid-completion-step2.py
@@ -0,0 +1,54 @@
+import os,sys
+import string
+from optparse import OptionParser
+from Bio import SeqIO
+import glob
+import MySQLdb
+import csv
+
+
+__version__="1.0"
+__status__ = "Dev"
+
+
+
+###############################
+def main():
+
+
+	patList = "/data/projects/targetdbs/filtered-nt/downloads/nucl_*.accession2taxid.2017-05-21"
+        fileList = glob.glob(patList)
+	ntFile = "/data/projects/targetdbs/filtered-nt/generated/logfile.step1.txt"
+
+
+	seqAcDic = {}
+	with open(ntFile, 'rb') as csvfile:
+		csvreader = csv.reader(csvfile, delimiter=':', quotechar='|')
+		for row in csvreader:
+			seqAc = row[1].strip()
+			seqAcDic[seqAc] = 1
+
+	FW = open("/data/projects/targetdbs/filtered-nt/generated/logfile.step2.txt", "w")
+	ac2taxid = {}
+	for fileName in fileList:
+                if fileName.find("nucl_gb.accession2taxid") == -1:
+			i = 0
+			with open(fileName, 'rb') as csvfile:
+                		csvreader = csv.reader(csvfile, delimiter='\t', quotechar='|')
+                		for row in csvreader:
+                        		seqAc = row[0].strip()
+					if seqAc in seqAcDic:
+						seqAcDic[seqAc] = 2
+					if i%10000000 == 0:
+						print "Data reading ", fileName, i
+					i += 1
+
+	for key,val in seqAcDic.items():
+		if val == 1:
+			FW.write("No taxid found for: %s\n" % (key))
+
+	FW.close()
+
+if __name__ == '__main__':
+        main()
+
diff --git a/check-ac2taxid-completion-step3.py b/check-ac2taxid-completion-step3.py
@@ -0,0 +1,56 @@
+import os,sys
+import string
+from optparse import OptionParser
+from Bio import SeqIO
+import glob
+import csv
+
+
+__version__="1.0"
+__status__ = "Dev"
+
+
+
+###############################
+def main():
+
+
+	patList = "/data/projects/targetdbs/filtered-nt/downloads/ac2taxid.2017-05-30/*"
+        fileList = glob.glob(patList)
+	ntFile = "/data/projects/targetdbs/filtered-nt/downloads/logfile.step2.txt"
+	passFile = ['nucl_wgs.accession2taxid', 'nucl_est.accession2taxid', 'nucl_gb.accession2taxid','nucl_gss.accession2taxid']
+
+
+	seqAcDic = {}
+	with open(ntFile, 'rb') as csvfile:
+		csvreader = csv.reader(csvfile, delimiter=':', quotechar='|')
+		for row in csvreader:
+			seqAc = row[1].strip()
+			seqAcDic[seqAc] = 1
+
+	FW = open("/data/projects/targetdbs/filtered-nt/downloads/logfile.step3.txt", "w")
+	ac2taxid = {}
+	for fileName in fileList:
+		i = 0
+		if fileName.split('/')[-1] not in passFile:
+			with open(fileName, 'rb') as csvfile:
+        	        	csvreader = csv.reader(csvfile, delimiter='\t', quotechar='|')
+                		for row in csvreader:
+                       			seqAc = row[0].strip()
+					if seqAc in seqAcDic:
+						seqAcDic[seqAc] = 2
+					if seqAc+seqAc[-1] in seqAcDic and fileName.find("pdb.accession2taxid") >= 0:
+						seqAcDic[seqAc+seqAc[-1]] = 2
+					if i%10000000 == 0:
+						print "Data reading ", fileName, i
+					i += 1
+
+	for key,val in seqAcDic.items():
+		if val == 1:
+			FW.write("No taxid found for: %s\n" % (key))
+
+	FW.close()
+
+if __name__ == '__main__':
+        main()
+