Skip to content

Commit

Permalink
New Repo
Browse files Browse the repository at this point in the history
New Repo location
  • Loading branch information
HadleyKing committed Jul 22, 2019
0 parents commit 0035341
Show file tree
Hide file tree
Showing 15 changed files with 1,021 additions and 0 deletions.
119 changes: 119 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Filtered NCBI-nt in FASTA format

Filtered NT dataset is generated by excluding sequences from the whole
nt file provided by NCBI, based on whether they have unwanted taxonomy
names or any child taxonomy name of these unwanted ones. These unwanted
taxonomy names are listed in the black list generated by two steps:
(1) Getting all taxonomy names which contain the strings listed
below (Step 3); (2) Getting all possible child taxonomy names of each
of the taxonomy names from (1). For example, "other sequences"
(taxId: 28384) is excluded with all its child taxonomy names including
"artificial sequence", "vector", "synthetic", and so on.

We have chosen to apply the Creative Commons Attribution 3.0
Unsupported License to this version of the software.



|Version | Downloadable Files | File Size | Release Notes|NCBI Download Date|
|--------|--------------------|-----------|--------------|------------------|
|Vesrion 6.0| [Filtered NT v6.0](https://hive.biochemistry.gwu.edu/prd/filterednt//content/filtered_nt_July_2018.fasta)| 168G|[Release Notes v6](https://hive.biochemistry.gwu.edu/filterednt/releasenotesv6)|July 2018|
|Version 5.0|[Filtered_NT v5.0](https://hive.biochemistry.gwu.edu/prd//filterednt/content/Filtered_NTv5.0.fasta)|131G|[Release Notes v5.0](https://hive.biochemistry.gwu.edu/filterednt/releasenotesv5)|May 2017|
|Version 4.0| [Filtered NT v4.0](https://hive.biochemistry.gwu.edu/prd//filterednt/content/Filtered_NTv4.0.fasta)|110G|[Release Notes v4.0](https://hive.biochemistry.gwu.edu/filterednt/releasenotesv4)|July 2016|




# Summary of the protocol

************************************************************************
## Step 1. Download the whole nt file
************************************************************************
downloaded from: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/
version: 5/21/2017
command:
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz
gunzip nt.gz (42,439,338 rows)

************************************************************************
## Step 2. Download the taxonomy list
************************************************************************
downloaded from: ftp://ftp.ncbi.nih.gov/pub/taxonomy/
version: 5/21/2017; 5/30/2017
command:
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/*.gz
gunzip *.gz
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
gunzip taxdump.tar.gz |tar -xvf
location: /data/projects/targetdbs/downloads/


************************************************************************
## Step 3. Generate black list
************************************************************************
protocol: unwanted taxonomy names (scientific names) from names.dmp and
all child taxonomy names of them, include:
['unclassified','unidentified','uncultured', \
'unspecified','unknown','phage','vector']
['environmental sample','artificial sequence','other sequence']

There are two steps for generating the black list, first is to
get all taxonomy names with the strings above, and then
to get all child taxonomy names of them.

script: /projects/targetdbs/scripts/get-parent-taxid-of-blacklist.py
/projects/targetdbs/scripts/get-child-taxid-of-blacklist.py

output: /data/projects/targetdbs/generated/blacklist-taxId.1.csv
/data/projects/targetdbs/generated/blacklist-taxId.2.csv

After generating blacklist-taxId.2.txt, use command line
"sort -u" to delete duplicated records, and store them into:
/data/projects/targetdbs/generated/blacklist-taxId.unique.csv

QC script: /projects/targetdbs/scripts/compare-old-new-blacklist.py
Compare the newly generated with the older version.


************************************************************************
## Step 4. Check the completion of taxonomy list (QC)
************************************************************************
protocol: First check if all seqAcs in nt file have taxIds from
nucl_gb.accession2taxid file, and the ones do not have taxIds
are checked in all other ac2taxid files.
script: /projects/targetdbs/scripts/check-ac2taxid-completion-step1.py
/projects/targetdbs/scripts/check-ac2taxid-completion-step2.py
/projects/targetdbs/scripts/check-ac2taxid-completion-step3.py
output: /data/projects/targetdbs/generated/logfile.step1.txt
/data/projects/targetdbs/generated/logfile.step2.txt
/data/projects/targetdbs/generated/logfile.step3.txt

This step needs a lot of memory. Suggest to run on large machine.
123 records of PDB accessions have extra characters, fixed
that in step3.py.
However, 28 records are not in the files, search taxIds
manually for them (/data/projects/targetdbs/generated/ \
logfile.step3.manually.added.txt).


************************************************************************
## Step 5. Get the seqAc-taxonomy list
************************************************************************
protocol: Exclude those taxIds in the blacklist. And first get all
seqAc-taxIds from nucl_gb.accession2taxid, and all of other
ac2taxid files from both version 05/21/2017 and 05/30/2017.
script: /projects/targetdbs/scripts/get-seqac2taxid.py
output: /data/projects/targetdbs/generated/logfile.ac2taxid.list.txt
QC step: All seqAcs in nt files are mapped to at least one taxId. The
number of seqAcs in the list matches the one in nt file.
SeqAcs with multiple taxIds are listed in:
/data/projects/targetdbs/generated/seqAc-with-multiple-taxids.txt


************************************************************************
## Step 6. Filtering nt file
************************************************************************
protocol: Remember to add those manually added ac2taxids.
script: /projects/targetdbs/scripts/filter-nt.py
output: /data/projects/targetdbs/generated/filtered_nt_Jun06-2017.fasta
QC script: /projects/targetdbs/scripts/check-removed-seqacs-count.py
61 changes: 61 additions & 0 deletions Release_note_May31_2017
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@

************************************************************************
* Downloaded Files
************************************************************************
1. nt file downloaded on 5/21/2017
ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/)

42,439,338 sequences


2. names.dump downloaded on 5/21/2017
ftp://ftp.ncbi.nih.gov/pub/taxonomy/

2,383,434 names
1,601,859 scientific names


3. ac2taxid files
ftp://ftp.ncbi.nih.gov/pub/taxonomy/

#records file name
39,775,235 nucl_gss.accession2taxid.2017-05-21
122,045,527 nucl_gb.accession2taxid.2017-05-21
76,436,508 nucl_est.accession2taxid.2017-05-21
361,700,039 nucl_wgs.accession2taxid.2017-05-21
12,406,761 ac2taxid.2017-05-30/dead_nucl.accession2taxid
39,775,235 ac2taxid.2017-05-30/nucl_gss.accession2taxid
122,236,860 ac2taxid.2017-05-30/nucl_gb.accession2taxid
76,436,632 ac2taxid.2017-05-30/nucl_est.accession2taxid
66,696,868 ac2taxid.2017-05-30/dead_wgs.accession2taxid
381,019 ac2taxid.2017-05-30/pdb.accession2taxid
362,474,815 ac2taxid.2017-05-30/nucl_wgs.accession2taxid


************************************************************************
* Filter statistics
************************************************************************
Number of taxonomy ids that are in black list is 378,341.

Sequences from a given black list of sources were removed. This list
of sources, number of associated taxonomic IDs and number
of removed sequences is given below.


blackListTaxonomyName #taxids #removered sequences
===================== ======= ====================
unidentified 49 97
uncultured 1 2
unknown 342 1026
unspecified 68 11435
unclassified 182192 847187
other sequence 12666 233354
phage 4594 8445
environmental sample 50697 6398042
unknown-manually 1 4
==================== ===== =======
total 250610 7499592


The number of sequences in this filtered-nt release is
34,939,806
53 changes: 53 additions & 0 deletions check-ac2taxid-completion-step1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
import os,sys
import string
from optparse import OptionParser
from Bio import SeqIO
import glob
import MySQLdb
import csv


__version__="1.0"
__status__ = "Dev"



###############################
def main():


patList = "/data/projects/targetdbs/filtered-nt/downloads/nucl_*.accession2taxid.2017-05-21"
fileList = glob.glob(patList)
ntFile = "/data/projects/targetdbs/filtered-nt/downloads/nt.2017-05-21"

FW = open("/data/projects/targetdbs/filtered-nt/generated/logfile.step1.txt", "w")
ac2taxid = {}
for fileName in fileList:
if fileName.find("nucl_gb.accession2taxid") >= 0:
i = 0
with open(fileName, 'rb') as csvfile:
csvreader = csv.reader(csvfile, delimiter='\t', quotechar='|')
for row in csvreader:
seqAc = row[0].strip()
ac2taxid[seqAc] = 1
if i%10000000 == 0:
print "Done loading ", fileName, i
i += 1


i = 0
for record in SeqIO.parse(ntFile, "fasta"):
seqAc = record.id
seqAc = seqAc.split('.')[0]
if seqAc not in ac2taxid:
FW.write("No taxid found for: %s\n" % (seqAc))
if i%10000000 == 0:
print "Done parsing", i
i += 1

FW.close()


if __name__ == '__main__':
main()

54 changes: 54 additions & 0 deletions check-ac2taxid-completion-step2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import os,sys
import string
from optparse import OptionParser
from Bio import SeqIO
import glob
import MySQLdb
import csv


__version__="1.0"
__status__ = "Dev"



###############################
def main():


patList = "/data/projects/targetdbs/filtered-nt/downloads/nucl_*.accession2taxid.2017-05-21"
fileList = glob.glob(patList)
ntFile = "/data/projects/targetdbs/filtered-nt/generated/logfile.step1.txt"


seqAcDic = {}
with open(ntFile, 'rb') as csvfile:
csvreader = csv.reader(csvfile, delimiter=':', quotechar='|')
for row in csvreader:
seqAc = row[1].strip()
seqAcDic[seqAc] = 1

FW = open("/data/projects/targetdbs/filtered-nt/generated/logfile.step2.txt", "w")
ac2taxid = {}
for fileName in fileList:
if fileName.find("nucl_gb.accession2taxid") == -1:
i = 0
with open(fileName, 'rb') as csvfile:
csvreader = csv.reader(csvfile, delimiter='\t', quotechar='|')
for row in csvreader:
seqAc = row[0].strip()
if seqAc in seqAcDic:
seqAcDic[seqAc] = 2
if i%10000000 == 0:
print "Data reading ", fileName, i
i += 1

for key,val in seqAcDic.items():
if val == 1:
FW.write("No taxid found for: %s\n" % (key))

FW.close()

if __name__ == '__main__':
main()

56 changes: 56 additions & 0 deletions check-ac2taxid-completion-step3.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
import os,sys
import string
from optparse import OptionParser
from Bio import SeqIO
import glob
import csv


__version__="1.0"
__status__ = "Dev"



###############################
def main():


patList = "/data/projects/targetdbs/filtered-nt/downloads/ac2taxid.2017-05-30/*"
fileList = glob.glob(patList)
ntFile = "/data/projects/targetdbs/filtered-nt/downloads/logfile.step2.txt"
passFile = ['nucl_wgs.accession2taxid', 'nucl_est.accession2taxid', 'nucl_gb.accession2taxid','nucl_gss.accession2taxid']


seqAcDic = {}
with open(ntFile, 'rb') as csvfile:
csvreader = csv.reader(csvfile, delimiter=':', quotechar='|')
for row in csvreader:
seqAc = row[1].strip()
seqAcDic[seqAc] = 1

FW = open("/data/projects/targetdbs/filtered-nt/downloads/logfile.step3.txt", "w")
ac2taxid = {}
for fileName in fileList:
i = 0
if fileName.split('/')[-1] not in passFile:
with open(fileName, 'rb') as csvfile:
csvreader = csv.reader(csvfile, delimiter='\t', quotechar='|')
for row in csvreader:
seqAc = row[0].strip()
if seqAc in seqAcDic:
seqAcDic[seqAc] = 2
if seqAc+seqAc[-1] in seqAcDic and fileName.find("pdb.accession2taxid") >= 0:
seqAcDic[seqAc+seqAc[-1]] = 2
if i%10000000 == 0:
print "Data reading ", fileName, i
i += 1

for key,val in seqAcDic.items():
if val == 1:
FW.write("No taxid found for: %s\n" % (key))

FW.close()

if __name__ == '__main__':
main()

Loading

0 comments on commit 0035341

Please sign in to comment.