-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
New Repo location
- Loading branch information
0 parents
commit 0035341
Showing
15 changed files
with
1,021 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
# Filtered NCBI-nt in FASTA format | ||
|
||
Filtered NT dataset is generated by excluding sequences from the whole | ||
nt file provided by NCBI, based on whether they have unwanted taxonomy | ||
names or any child taxonomy name of these unwanted ones. These unwanted | ||
taxonomy names are listed in the black list generated by two steps: | ||
(1) Getting all taxonomy names which contain the strings listed | ||
below (Step 3); (2) Getting all possible child taxonomy names of each | ||
of the taxonomy names from (1). For example, "other sequences" | ||
(taxId: 28384) is excluded with all its child taxonomy names including | ||
"artificial sequence", "vector", "synthetic", and so on. | ||
|
||
We have chosen to apply the Creative Commons Attribution 3.0 | ||
Unsupported License to this version of the software. | ||
|
||
|
||
|
||
|Version | Downloadable Files | File Size | Release Notes|NCBI Download Date| | ||
|--------|--------------------|-----------|--------------|------------------| | ||
|Vesrion 6.0| [Filtered NT v6.0](https://hive.biochemistry.gwu.edu/prd/filterednt//content/filtered_nt_July_2018.fasta)| 168G|[Release Notes v6](https://hive.biochemistry.gwu.edu/filterednt/releasenotesv6)|July 2018| | ||
|Version 5.0|[Filtered_NT v5.0](https://hive.biochemistry.gwu.edu/prd//filterednt/content/Filtered_NTv5.0.fasta)|131G|[Release Notes v5.0](https://hive.biochemistry.gwu.edu/filterednt/releasenotesv5)|May 2017| | ||
|Version 4.0| [Filtered NT v4.0](https://hive.biochemistry.gwu.edu/prd//filterednt/content/Filtered_NTv4.0.fasta)|110G|[Release Notes v4.0](https://hive.biochemistry.gwu.edu/filterednt/releasenotesv4)|July 2016| | ||
|
||
|
||
|
||
|
||
# Summary of the protocol | ||
|
||
************************************************************************ | ||
## Step 1. Download the whole nt file | ||
************************************************************************ | ||
downloaded from: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ | ||
version: 5/21/2017 | ||
command: | ||
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz | ||
gunzip nt.gz (42,439,338 rows) | ||
|
||
************************************************************************ | ||
## Step 2. Download the taxonomy list | ||
************************************************************************ | ||
downloaded from: ftp://ftp.ncbi.nih.gov/pub/taxonomy/ | ||
version: 5/21/2017; 5/30/2017 | ||
command: | ||
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/*.gz | ||
gunzip *.gz | ||
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz | ||
gunzip taxdump.tar.gz |tar -xvf | ||
location: /data/projects/targetdbs/downloads/ | ||
|
||
|
||
************************************************************************ | ||
## Step 3. Generate black list | ||
************************************************************************ | ||
protocol: unwanted taxonomy names (scientific names) from names.dmp and | ||
all child taxonomy names of them, include: | ||
['unclassified','unidentified','uncultured', \ | ||
'unspecified','unknown','phage','vector'] | ||
['environmental sample','artificial sequence','other sequence'] | ||
|
||
There are two steps for generating the black list, first is to | ||
get all taxonomy names with the strings above, and then | ||
to get all child taxonomy names of them. | ||
|
||
script: /projects/targetdbs/scripts/get-parent-taxid-of-blacklist.py | ||
/projects/targetdbs/scripts/get-child-taxid-of-blacklist.py | ||
|
||
output: /data/projects/targetdbs/generated/blacklist-taxId.1.csv | ||
/data/projects/targetdbs/generated/blacklist-taxId.2.csv | ||
|
||
After generating blacklist-taxId.2.txt, use command line | ||
"sort -u" to delete duplicated records, and store them into: | ||
/data/projects/targetdbs/generated/blacklist-taxId.unique.csv | ||
|
||
QC script: /projects/targetdbs/scripts/compare-old-new-blacklist.py | ||
Compare the newly generated with the older version. | ||
|
||
|
||
************************************************************************ | ||
## Step 4. Check the completion of taxonomy list (QC) | ||
************************************************************************ | ||
protocol: First check if all seqAcs in nt file have taxIds from | ||
nucl_gb.accession2taxid file, and the ones do not have taxIds | ||
are checked in all other ac2taxid files. | ||
script: /projects/targetdbs/scripts/check-ac2taxid-completion-step1.py | ||
/projects/targetdbs/scripts/check-ac2taxid-completion-step2.py | ||
/projects/targetdbs/scripts/check-ac2taxid-completion-step3.py | ||
output: /data/projects/targetdbs/generated/logfile.step1.txt | ||
/data/projects/targetdbs/generated/logfile.step2.txt | ||
/data/projects/targetdbs/generated/logfile.step3.txt | ||
|
||
This step needs a lot of memory. Suggest to run on large machine. | ||
123 records of PDB accessions have extra characters, fixed | ||
that in step3.py. | ||
However, 28 records are not in the files, search taxIds | ||
manually for them (/data/projects/targetdbs/generated/ \ | ||
logfile.step3.manually.added.txt). | ||
|
||
|
||
************************************************************************ | ||
## Step 5. Get the seqAc-taxonomy list | ||
************************************************************************ | ||
protocol: Exclude those taxIds in the blacklist. And first get all | ||
seqAc-taxIds from nucl_gb.accession2taxid, and all of other | ||
ac2taxid files from both version 05/21/2017 and 05/30/2017. | ||
script: /projects/targetdbs/scripts/get-seqac2taxid.py | ||
output: /data/projects/targetdbs/generated/logfile.ac2taxid.list.txt | ||
QC step: All seqAcs in nt files are mapped to at least one taxId. The | ||
number of seqAcs in the list matches the one in nt file. | ||
SeqAcs with multiple taxIds are listed in: | ||
/data/projects/targetdbs/generated/seqAc-with-multiple-taxids.txt | ||
|
||
|
||
************************************************************************ | ||
## Step 6. Filtering nt file | ||
************************************************************************ | ||
protocol: Remember to add those manually added ac2taxids. | ||
script: /projects/targetdbs/scripts/filter-nt.py | ||
output: /data/projects/targetdbs/generated/filtered_nt_Jun06-2017.fasta | ||
QC script: /projects/targetdbs/scripts/check-removed-seqacs-count.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
|
||
************************************************************************ | ||
* Downloaded Files | ||
************************************************************************ | ||
1. nt file downloaded on 5/21/2017 | ||
ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/) | ||
|
||
42,439,338 sequences | ||
|
||
|
||
2. names.dump downloaded on 5/21/2017 | ||
ftp://ftp.ncbi.nih.gov/pub/taxonomy/ | ||
|
||
2,383,434 names | ||
1,601,859 scientific names | ||
|
||
|
||
3. ac2taxid files | ||
ftp://ftp.ncbi.nih.gov/pub/taxonomy/ | ||
|
||
#records file name | ||
39,775,235 nucl_gss.accession2taxid.2017-05-21 | ||
122,045,527 nucl_gb.accession2taxid.2017-05-21 | ||
76,436,508 nucl_est.accession2taxid.2017-05-21 | ||
361,700,039 nucl_wgs.accession2taxid.2017-05-21 | ||
12,406,761 ac2taxid.2017-05-30/dead_nucl.accession2taxid | ||
39,775,235 ac2taxid.2017-05-30/nucl_gss.accession2taxid | ||
122,236,860 ac2taxid.2017-05-30/nucl_gb.accession2taxid | ||
76,436,632 ac2taxid.2017-05-30/nucl_est.accession2taxid | ||
66,696,868 ac2taxid.2017-05-30/dead_wgs.accession2taxid | ||
381,019 ac2taxid.2017-05-30/pdb.accession2taxid | ||
362,474,815 ac2taxid.2017-05-30/nucl_wgs.accession2taxid | ||
|
||
|
||
************************************************************************ | ||
* Filter statistics | ||
************************************************************************ | ||
Number of taxonomy ids that are in black list is 378,341. | ||
|
||
Sequences from a given black list of sources were removed. This list | ||
of sources, number of associated taxonomic IDs and number | ||
of removed sequences is given below. | ||
|
||
|
||
blackListTaxonomyName #taxids #removered sequences | ||
===================== ======= ==================== | ||
unidentified 49 97 | ||
uncultured 1 2 | ||
unknown 342 1026 | ||
unspecified 68 11435 | ||
unclassified 182192 847187 | ||
other sequence 12666 233354 | ||
phage 4594 8445 | ||
environmental sample 50697 6398042 | ||
unknown-manually 1 4 | ||
==================== ===== ======= | ||
total 250610 7499592 | ||
|
||
|
||
The number of sequences in this filtered-nt release is | ||
34,939,806 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
import os,sys | ||
import string | ||
from optparse import OptionParser | ||
from Bio import SeqIO | ||
import glob | ||
import MySQLdb | ||
import csv | ||
|
||
|
||
__version__="1.0" | ||
__status__ = "Dev" | ||
|
||
|
||
|
||
############################### | ||
def main(): | ||
|
||
|
||
patList = "/data/projects/targetdbs/filtered-nt/downloads/nucl_*.accession2taxid.2017-05-21" | ||
fileList = glob.glob(patList) | ||
ntFile = "/data/projects/targetdbs/filtered-nt/downloads/nt.2017-05-21" | ||
|
||
FW = open("/data/projects/targetdbs/filtered-nt/generated/logfile.step1.txt", "w") | ||
ac2taxid = {} | ||
for fileName in fileList: | ||
if fileName.find("nucl_gb.accession2taxid") >= 0: | ||
i = 0 | ||
with open(fileName, 'rb') as csvfile: | ||
csvreader = csv.reader(csvfile, delimiter='\t', quotechar='|') | ||
for row in csvreader: | ||
seqAc = row[0].strip() | ||
ac2taxid[seqAc] = 1 | ||
if i%10000000 == 0: | ||
print "Done loading ", fileName, i | ||
i += 1 | ||
|
||
|
||
i = 0 | ||
for record in SeqIO.parse(ntFile, "fasta"): | ||
seqAc = record.id | ||
seqAc = seqAc.split('.')[0] | ||
if seqAc not in ac2taxid: | ||
FW.write("No taxid found for: %s\n" % (seqAc)) | ||
if i%10000000 == 0: | ||
print "Done parsing", i | ||
i += 1 | ||
|
||
FW.close() | ||
|
||
|
||
if __name__ == '__main__': | ||
main() | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
import os,sys | ||
import string | ||
from optparse import OptionParser | ||
from Bio import SeqIO | ||
import glob | ||
import MySQLdb | ||
import csv | ||
|
||
|
||
__version__="1.0" | ||
__status__ = "Dev" | ||
|
||
|
||
|
||
############################### | ||
def main(): | ||
|
||
|
||
patList = "/data/projects/targetdbs/filtered-nt/downloads/nucl_*.accession2taxid.2017-05-21" | ||
fileList = glob.glob(patList) | ||
ntFile = "/data/projects/targetdbs/filtered-nt/generated/logfile.step1.txt" | ||
|
||
|
||
seqAcDic = {} | ||
with open(ntFile, 'rb') as csvfile: | ||
csvreader = csv.reader(csvfile, delimiter=':', quotechar='|') | ||
for row in csvreader: | ||
seqAc = row[1].strip() | ||
seqAcDic[seqAc] = 1 | ||
|
||
FW = open("/data/projects/targetdbs/filtered-nt/generated/logfile.step2.txt", "w") | ||
ac2taxid = {} | ||
for fileName in fileList: | ||
if fileName.find("nucl_gb.accession2taxid") == -1: | ||
i = 0 | ||
with open(fileName, 'rb') as csvfile: | ||
csvreader = csv.reader(csvfile, delimiter='\t', quotechar='|') | ||
for row in csvreader: | ||
seqAc = row[0].strip() | ||
if seqAc in seqAcDic: | ||
seqAcDic[seqAc] = 2 | ||
if i%10000000 == 0: | ||
print "Data reading ", fileName, i | ||
i += 1 | ||
|
||
for key,val in seqAcDic.items(): | ||
if val == 1: | ||
FW.write("No taxid found for: %s\n" % (key)) | ||
|
||
FW.close() | ||
|
||
if __name__ == '__main__': | ||
main() | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
import os,sys | ||
import string | ||
from optparse import OptionParser | ||
from Bio import SeqIO | ||
import glob | ||
import csv | ||
|
||
|
||
__version__="1.0" | ||
__status__ = "Dev" | ||
|
||
|
||
|
||
############################### | ||
def main(): | ||
|
||
|
||
patList = "/data/projects/targetdbs/filtered-nt/downloads/ac2taxid.2017-05-30/*" | ||
fileList = glob.glob(patList) | ||
ntFile = "/data/projects/targetdbs/filtered-nt/downloads/logfile.step2.txt" | ||
passFile = ['nucl_wgs.accession2taxid', 'nucl_est.accession2taxid', 'nucl_gb.accession2taxid','nucl_gss.accession2taxid'] | ||
|
||
|
||
seqAcDic = {} | ||
with open(ntFile, 'rb') as csvfile: | ||
csvreader = csv.reader(csvfile, delimiter=':', quotechar='|') | ||
for row in csvreader: | ||
seqAc = row[1].strip() | ||
seqAcDic[seqAc] = 1 | ||
|
||
FW = open("/data/projects/targetdbs/filtered-nt/downloads/logfile.step3.txt", "w") | ||
ac2taxid = {} | ||
for fileName in fileList: | ||
i = 0 | ||
if fileName.split('/')[-1] not in passFile: | ||
with open(fileName, 'rb') as csvfile: | ||
csvreader = csv.reader(csvfile, delimiter='\t', quotechar='|') | ||
for row in csvreader: | ||
seqAc = row[0].strip() | ||
if seqAc in seqAcDic: | ||
seqAcDic[seqAc] = 2 | ||
if seqAc+seqAc[-1] in seqAcDic and fileName.find("pdb.accession2taxid") >= 0: | ||
seqAcDic[seqAc+seqAc[-1]] = 2 | ||
if i%10000000 == 0: | ||
print "Data reading ", fileName, i | ||
i += 1 | ||
|
||
for key,val in seqAcDic.items(): | ||
if val == 1: | ||
FW.write("No taxid found for: %s\n" % (key)) | ||
|
||
FW.close() | ||
|
||
if __name__ == '__main__': | ||
main() | ||
|
Oops, something went wrong.