kraken
is a tool for rapid identification of species from short read data (Wood and Salzberg 2014). It
uses a kmer
database, with kmers
identified by their NCBI
Taxon ID. It comes with a
few databases preconfigured, but sometimes they are not enough, and they are also significantly
out of date.
While kraken
provides for a limited API to build your own, custom, database, it leaves it
up to the user to find the sequences of interest, and rename them in an appropriate way so
that the kraken-build
tool can interpret the sequences correctly.
What kraken-trawl
aims to do is to automate as much as possible the creation of
custom kraken
databases. The user only needs a list of species names of interest.
kraken-trawl
will then download the genomes, if available, transform them to the
appropriate name, and place them in a folder structure appropriate for kraken-build
.
The user can create a new DB from scratch, or add to an existing DB.
kraken-trawl
depends on kraken
. Download and installation instructions can
be found here.
The current version assumes you have aspera
ascp
installed and in the path.
It also assumes that you have an environmental variable called ASPERA_KEY
that
points to the aspera key for downloading from NCBI
. For a guide on obtaining
ascp
and a key from NCBI
go here.
kraken-trawl
also depends has the following python
dependencies:
python 2.7
BioPython
pandas >=0.19
click >=5
The easiest way of installing kraken-trawl
is using pip
:
pip install git+https://github.com/andersgs/kraken-trawl.git
Use the --user
option to install locally (by default in $HOME/.local/bin
):
pip install --user git+https://github.com/andersgs/kraken-trawl.git
Use the --install-option
to install the script in a particular location (say you have some other folder on your PATH
where you store scripts):
pip install --install-option="--install-scripts=$HOME/bin" --user git+https://github.com/andersgs/kraken-trawl.git
Once installed type the following:
kraken-trawl --help
The current version relies on an environmental variable to work properly:
ASPERA_KEY
Either type the following before using kraken-trawl
, or paste this line in to your
bash_profile
:
export ASPERA_KEY=/path/to/asperakey/aspera_key.openssh
You may ask, why am I using environment variables for configuration? The reason is that this is currently considered best practice in tool development. More here: 12factor app.
Usage: kraken-trawl [OPTIONS]
Options:
--assemb_file TEXT assembly_summary file
--kraken_db TEXT name of kraken db -c, --create TEXT create a new database at path <give path>
--kraken_db_path TEXT path to kraken db --taxon_list TEXT give it a taxon list to filter the
assembly_summary. --include_human include the human reference genome
--filter_opt TEXT Can be all, strict, moderate, or liberal. --outdir TEXT Where to place downloaded genomes.
--no_log Do NOT output a tab-delimited list of genomes added
--do_not_inject_adapters Do **not** add Illumina adapter and primer
sequences to DB
--threads INTEGER How many threads to use when building the database
--kmer_len INTEGER kmer length to use when building database
--minz_len INTEGER minimizer length to use when building database
--clean clean db of unnecessary files after making the DB
--help Show this message and exit.
- A series of folders each containing a
*.genomic.gbff.gz
file. Because we are usingaspera
if the folder structure is maintained, only novel genomes will be downloaded subsequently. - an
aspera_src_dest.txt
file containing theSRC
andDEST
for each genome. This file is used to for theaspera
download . - individual genome
*.fna
files in the<kraken_db>/libraries/added
folder. - a
log
file, with atab-delimited
information on each downloaded file:- Organism --- the organism name, including any strain names
- Accession --- the accession id
- RefSeq --- the RefSeq category, if any
- Assembly Level --- the assembly level (i.e.,
Complete Genome
,Chromosome
,Scaffold
, orContig
) - Source --- the location of the
genbank
file on theNCBI
servers - Destination --- the local folder with the
genbank
file
- a
missing_taxon.txt
--- a list of taxons for which no genomes were found
Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 2014, 15:R46 URL