-
Notifications
You must be signed in to change notification settings - Fork 4
Reference package operations
TreeSAPP's reference packages are formatted as a single, compressed file known as a pickle (.pkl extension). Because they are compressed, they are not human-readable and therefore could be a bit of a black box.
With the command treesapp package
, these become not just viewable but also mutable. The three subcommands of treesapp package
are view
, edit
and rename
.
The usage of all three is very similar in that the same command-line arguments can be provided to these commands.
However, it is worth going over some key differences.
Supported versions >=0.11.0
usage: treesapp package [--overwrite] [-v] [-h] [{view,edit,rename}]
Facilitate operations on reference packages
positional arguments:
{view,edit,rename} A subcommand specifying the type of operation to
perform. `treesapp package rename` must be followed only
by 'prefix' and the new value. Example: `treesapp
package rename prefix Xyz -r path/to/Xyz_build.pkl`
Miscellaneous options:
--overwrite Overwrites previously written output files and
directories
-v, --verbose Prints a more verbose runtime log
-h, --help Show this help message and exit
treesapp package view
is used to read different attributes of a reference package.
By default these are printed to the console window via standard out, meaning these could be piped to other UNIX commands (e.g. awk and sed).
Run treesapp package view -h
to get the complete list of attributes that can be printed by view
. These are under the 'attributes' argument, as shown above.
view
is able to show multiple attributes from the same command by including different attribute names separated by spaces.
For example, to print the name, reference package code, TreeSAPP version, and description of the reference package found at Photosynthesis/PuhA/seed_refpkg/final_outputs/PuhA_build.pkl
$ treesapp package view prefix refpkg_code ts_version description -r Photosynthesis/PuhA/seed_refpkg/final_outputs/PuhA_build.pkl
prefix PuhA
refpkg_code P0001
ts_version 0.9.5
description
TreeSAPP has finished successfully.
Contents of core reference package files (e.g. the profile HMM, MSA and phylogenetic tree) can also be viewed, or redirected into new files.
treesapp package view tree -r Photosynthesis/PuhA/seed_refpkg/final_outputs/PuhA_build.pkl >PuhA_tree.nwk
Note: The tree's leaf labels are replaced with their respective sequence's description. The descriptions are start with the organism name then the sequence accession, separated with a '|' character. The 'msa' (multiple sequence alignment) attribute retains the internal identifiers (e.g. '1_PuhA').
treesapp package edit
is used to modify reference package attributes. If you've already used treesapp create
to build a reference package, chances are you've already used this to change the refpkg_code
attribute of your new reference package.
Unlike view
, edit
will only accept -- and therefore modify -- a single attribute. The edited reference package can be written to a new directory or overwrite the original reference package.
Warning: the reference package will be named based on the 'prefix' attribute and the argument '--output' specifies a directory path.
Therefore, writing to the same directory of the original reference package could have unintended consequences!
To write a new reference package pickle for PuhA with its 'description' attribute modified, you could use
treesapp package edit description 'Photosynthetic reaction center subunit H' -r Photosynthesis/PuhA/seed_refpkg/final_outputs/PuhA_build.pkl --output ~/Desktop/
Modifying the following attributes with treesapp package edit
is currently not supported: lineage_ids
, num_seqs
, pfit
, prefix
, refpkg_suffix
, sub_model
.
Editing file-based reference package attributes ('msa', 'profile', 'search_profile', 'tree' and 'model_info') is possible but not recommended. This can be accomplished by editing the attribute with the 'f__' prefix. For example, to change the search_profile attribute you could use:
treesapp package edit f__search_profile <path to new>.hmm -r <path to refpkg>/<prefix>_build..pkl
The ability to annotate multiple phenotypes (such as pathway, reaction or substrate) across the leaves in a reference package's tree is one of TreeSAPP's advantages. Unfortunately, there is no automated method for knowing these a priori so users are burdened with the task of manually curating such annotations. Reference packages can store multiple phenotypes in their 'feature_annotations' attribute, making them useful vehicles for storing and sharing annotations at many levels of sequence and taxonomic resolution.
These annotations also don't necessarily need to be based on phenotype. They could be as arbitrary as ecosystems they were isolated from, or whether they have been biochemically characterized.
There are two ways to go about this, both of which require a two-column table with the phenotype in the second column.
The first method involves compiling the phenotypes of the reference package's taxa. Each taxon is the first column in the table, and these do not need to be at the same taxonomic rank. The second column stores the phenotype property for its corresponding taxonomic group. Not all sequences need to be annotated in this phenotypes table. It is often the case that some sequences or even large groups of organisms have yet to be characterized. And that's okay. An example can be found for McrA in the RefPkgs repository, called Mcr_taxonomy-phenotype_map.tsv.
In our Mcr example, the phenotypes are the different alkane-transforming pathways (i.e. methanogenesis and methanotrophy). Here are the first few rows:
g__Methanothrix | Aceticlastic |
o__Methanomicrobiales | CO2-dependent hydrogenotrophy |
o__Methanocellales | CO2-dependent hydrogenotrophy |
Methanoflorens stordalenmirensis | CO2-dependent hydrogenotrophy |
p__Candidatus Verstraetearchaeota | CH3-dependent hydrogenotrophy |
As you can see some taxonomic labels have rank-prefixes (e.g. 'p__' to indicate the rank of Phylum), and some do not. The difference is those that do can refer to all reference leaves that belong to that lineage while those that do not (like 'Methanoflorens stordalenmirensis') can only refer to the leaves whose organism name match the taxonomic label. Additionally, to have even finer-grained control, you are able to use a sequence's accession ID instead of a taxonomic label*.
The annotations can then be associated with the references sequences belonging to each taxon by providing the table to the '--taxa_map' argument.
An example for McrA follows
treesapp package edit feature_annotations Pathway --taxa_map McrA_taxonomy-phenotype_map.tsv -r ProcessedData/TreeSAPP_output/updates/McrA_build.pkl
If your phenotype is linked taxonomically (i.e. there are not multiple different phenotypes present in the same taxonomic group), you are free to reuse this table whenever you create or update a reference package.
The second method attempts automation but there is still plenty of room for mistakes to be made. User feedback is appreciated!
Here, we will first use treesapp assign
to classify sequences that are known to have a certain phenotype.
Then we will create the table mapping query sequence placement edges to their phenotype.
treesapp assign -i <phenotype X>.fasta -o phenotype_X_assign/
Now, we can get the unique placement edges and write their phenotypes to a new TSV file
grep -v Query phenotype_X_assign/final_outputs/classifications.tsv | gawk -F"\t" '{ print $8 }' | sort -u | sed 's/$/\tPhenotype/g' >>RefPkg_phenotype_map.tsv
These steps would need to be repeated for each set of sequences that are linked to a phenotype. Once you have completed this you can finally run treesapp package edit
as before.
Users are able to edit the taxonomic lineages of their reference package after it has been created. This may be useful when beginning with NCBI-based lineages then switching to GTDB for better resolution, or demanding a genome-based taxonomy. TreeSAPP also truncates taxonomic lineages if a canonical rank is missing, such as with NCBI's lineage for Cyanobacteria. Editing the 'lineage_ids' attribute could restore a complete taxonomic lineage to these sequences. There are two options for updating a reference package's taxonomic lineages, to address small and large-scale changes.
When relatively few lineages need to be changed, treesapp package edit
can be provided a table with all sequence identifiers present in the reference package.
Begin by creating a table with the original lineages then editing this to ensure compatibility.
treesapp package view lineage_ids -r <path to refpkg dir>/<name>_build.pkl >RefPkg_lineage_ids.tsv
Once you've saved your lineage changes you can edit the reference package with RefPkg_lineage_ids.tsv. It's always a good idea to edited reference packages to a new directory in case tables contain a mistake. The following example will write the updated reference package to a directory called 'output'.
treesapp package edit lineage_ids RefPkg_lineage_ids.tsv -r <path to refpkg dir>/<name>_build.pkl -o output/
This example is meant for replacing multiple lineages with the least amount of effort. It requires a two-column tab or comma-delimited table for the old and new lineages, respectively. The table can have a header if the first character is a '#'.
The key difference in this edit
command is including '--join'.
This flag indicates a left-join be performed with the original taxonomic lineages.
By performing a left-join not all original lineages need to be in the first column of the table, leaving those omitted unchanged.
In the following example the file RefPkg_lineage_join.tsv would contain all lineages to be changed.
treesapp package edit lineage_ids RefPkg_lineage_join.tsv --join -o output/ -r <path to refpkg dir>/<name>_build.pkl
treesapp package rename
is a special reference package editing operation where multiple attributes are modified to match the new name (i.e. the 'prefix' attribute) of a reference package.
This prefix attribute is used in multiple other data collections within a reference package, such as the phylogeny's leaf node names and profile HMM name, so converting them all with treesapp package edit
could be tedious, if not problematic.
The rename
subcommand requires the 'prefix' attribute to still be specified.
Beyond this, it's nearly identical in usage to edit
.
treesapp package rename prefix Photo-R-C-S-H -r ~/Desktop/PuhA_build.pkl -o ~/Desktop/
This will create the new reference package pickle ~/Desktop/Photo-R-C-S-H_build.pkl
.
If you have other reference package operations that you'd like to see - create a new request in the GitHub Issues tracker!