Skip to content

Reference package operations

cmorganl edited this page Apr 27, 2021 · 8 revisions

Overview

TreeSAPP's reference packages are formatted as a single, compressed file known as a pickle (.pkl extension). Because they are compressed, they are not human-readable and therefore could be a bit of a black box.

With the command treesapp package, these become not just viewable but also mutable. The three subcommands of treesapp package are view, edit and rename. The usage of all three is very similar in that the same command-line arguments can be provided to these commands. However, it is worth going over some key differences.

Supported versions >=0.11.0


Table of Contents

General usage

usage: treesapp package [--overwrite] [-v] [-h] [{view,edit,rename}]

Facilitate operations on reference packages

positional arguments:
  {view,edit,rename}  A subcommand specifying the type of operation to
                      perform. `treesapp package rename` must be followed only
                      by 'prefix' and the new value. Example: `treesapp
                      package rename prefix Xyz -r path/to/Xyz_build.pkl`

Miscellaneous options:
  --overwrite         Overwrites previously written output files and
                      directories
  -v, --verbose       Prints a more verbose runtime log
  -h, --help          Show this help message and exit

View

Description

treesapp package view is used to read different attributes of a reference package. By default these are printed to the console window via standard out, meaning these could be piped to other UNIX commands (e.g. awk and sed).

Usage

Run treesapp package view -h to get the complete list of attributes that can be printed by view. These are under the 'attributes' argument, as shown above. view is able to show multiple attributes from the same command by including different attribute names separated by spaces. For example, to print the name, reference package code, TreeSAPP version, and description of the reference package found at Photosynthesis/PuhA/seed_refpkg/final_outputs/PuhA_build.pkl

$ treesapp package view prefix refpkg_code ts_version description -r Photosynthesis/PuhA/seed_refpkg/final_outputs/PuhA_build.pkl
prefix	PuhA
refpkg_code	P0001
ts_version	0.9.5
description	
TreeSAPP has finished successfully.

Contents of core reference package files (e.g. the profile HMM, MSA and phylogenetic tree) can also be viewed, or redirected into new files.

treesapp package view tree -r Photosynthesis/PuhA/seed_refpkg/final_outputs/PuhA_build.pkl >PuhA_tree.nwk

Note: The tree's leaf labels are replaced with their respective sequence's description. The descriptions are start with the organism name then the sequence accession, separated with a '|' character. The 'msa' (multiple sequence alignment) attribute retains the internal identifiers (e.g. '1_PuhA').

Edit

Description

treesapp package edit is used to modify reference package attributes. If you've already used treesapp create to build a reference package, chances are you've already used this to change the refpkg_code attribute of your new reference package.

Usage

Unlike view, edit will only accept -- and therefore modify -- a single attribute. The edited reference package can be written to a new directory or overwrite the original reference package. Warning: the reference package will be named based on the 'prefix' attribute and the argument '--output' specifies a directory path. Therefore, writing to the same directory of the original reference package could have unintended consequences!

To write a new reference package pickle for PuhA with its 'description' attribute modified, you could use

treesapp package edit description 'Photosynthetic reaction center subunit H' -r Photosynthesis/PuhA/seed_refpkg/final_outputs/PuhA_build.pkl --output ~/Desktop/

Modifying the following attributes with treesapp package edit is currently not supported: lineage_ids, num_seqs, pfit, prefix, refpkg_suffix, sub_model.

Files

Editing file-based reference package attributes ('msa', 'profile', 'search_profile', 'tree' and 'model_info') is possible but not recommended. This can be accomplished by editing the attribute with the 'f__' prefix. For example, to change the search_profile attribute you could use:

treesapp package edit f__search_profile <path to new>.hmm -r <path to refpkg>/<prefix>_build..pkl

Phenotype annotations

The ability to annotate multiple phenotypes (such as pathway, reaction or substrate) across the leaves in a reference package's tree is one of TreeSAPP's advantages. Unfortunately, there is no automated method for knowing these a priori so users are burdened with the task of manually curating such annotations. Reference packages can store multiple phenotypes in their 'feature_annotations' attribute, making them useful vehicles for storing and sharing annotations at many levels of sequence and taxonomic resolution.

These annotations also don't necessarily need to be based on phenotype. They could be as arbitrary as ecosystems they were isolated from, or whether they have been biochemically characterized.

There are two ways to go about this, both of which require a two-column table with the phenotype in the second column.

Taxonomy-linkage

The first method involves compiling the phenotypes of the reference package's taxa. Each taxon is the first column in the table, and these do not need to be at the same taxonomic rank. The second column stores the phenotype property for its corresponding taxonomic group. Not all sequences need to be annotated in this phenotypes table. It is often the case that some sequences or even large groups of organisms have yet to be characterized. And that's okay. An example can be found for McrA in the RefPkgs repository, called Mcr_taxonomy-phenotype_map.tsv.

In our Mcr example, the phenotypes are the different alkane-transforming pathways (i.e. methanogenesis and methanotrophy). Here are the first few rows:

g__Methanothrix Aceticlastic
o__Methanomicrobiales CO2-dependent hydrogenotrophy
o__Methanocellales CO2-dependent hydrogenotrophy
Methanoflorens stordalenmirensis CO2-dependent hydrogenotrophy
p__Candidatus Verstraetearchaeota CH3-dependent hydrogenotrophy

As you can see some taxonomic labels have rank-prefixes (e.g. 'p__' to indicate the rank of Phylum), and some do not. The difference is those that do can refer to all reference leaves that belong to that lineage while those that do not (like 'Methanoflorens stordalenmirensis') can only refer to the leaves whose organism name match the taxonomic label. Additionally, to have even finer-grained control, you are able to use a sequence's accession ID instead of a taxonomic label*.

The annotations can then be associated with the references sequences belonging to each taxon by providing the table to the '--taxa_map' argument.

An example for McrA follows

treesapp package edit feature_annotations Pathway --taxa_map McrA_taxonomy-phenotype_map.tsv -r ProcessedData/TreeSAPP_output/updates/McrA_build.pkl

If your phenotype is linked taxonomically (i.e. there are not multiple different phenotypes present in the same taxonomic group), you are free to reuse this table whenever you create or update a reference package.

Phylogeny-linkage

The second method attempts automation but there is still plenty of room for mistakes to be made. User feedback is appreciated!

Here, we will first use treesapp assign to classify sequences that are known to have a certain phenotype. Then we will create the table mapping query sequence placement edges to their phenotype.

treesapp assign -i <phenotype X>.fasta -o phenotype_X_assign/

Now, we can get the unique placement edges and write their phenotypes to a new TSV file

grep -v Query phenotype_X_assign/final_outputs/classifications.tsv | gawk -F"\t" '{ print $8 }' | sort -u | sed 's/$/\tPhenotype/g' >>RefPkg_phenotype_map.tsv

These steps would need to be repeated for each set of sequences that are linked to a phenotype. Once you have completed this you can finally run treesapp package edit as before.

Reference sequence lineages

Users are able to edit the taxonomic lineages of their reference package after it has been created. This may be useful when beginning with NCBI-based lineages then switching to GTDB for better resolution, or demanding a genome-based taxonomy. TreeSAPP also truncates taxonomic lineages if a canonical rank is missing, such as with NCBI's lineage for Cyanobacteria. Editing the 'lineage_ids' attribute could restore a complete taxonomic lineage to these sequences. There are two options for updating a reference package's taxonomic lineages, to address small and large-scale changes.

Small-scale

When relatively few lineages need to be changed, treesapp package edit can be provided a table with all sequence identifiers present in the reference package. Begin by creating a table with the original lineages then editing this to ensure compatibility.

treesapp package view lineage_ids -r <path to refpkg dir>/<name>_build.pkl >RefPkg_lineage_ids.tsv

Once you've saved your lineage changes you can edit the reference package with RefPkg_lineage_ids.tsv. It's always a good idea to edited reference packages to a new directory in case tables contain a mistake. The following example will write the updated reference package to a directory called 'output'.

treesapp package edit lineage_ids RefPkg_lineage_ids.tsv -r <path to refpkg dir>/<name>_build.pkl -o output/

Large-scale

This example is meant for replacing multiple lineages with the least amount of effort. It requires a two-column tab or comma-delimited table for the old and new lineages, respectively. The table can have a header if the first character is a '#'.

The key difference in this edit command is including '--join'. This flag indicates a left-join be performed with the original taxonomic lineages. By performing a left-join not all original lineages need to be in the first column of the table, leaving those omitted unchanged. In the following example the file RefPkg_lineage_join.tsv would contain all lineages to be changed.

treesapp package edit lineage_ids RefPkg_lineage_join.tsv --join -o output/ -r <path to refpkg dir>/<name>_build.pkl

Rename

Description

treesapp package rename is a special reference package editing operation where multiple attributes are modified to match the new name (i.e. the 'prefix' attribute) of a reference package. This prefix attribute is used in multiple other data collections within a reference package, such as the phylogeny's leaf node names and profile HMM name, so converting them all with treesapp package edit could be tedious, if not problematic.

Usage

The rename subcommand requires the 'prefix' attribute to still be specified. Beyond this, it's nearly identical in usage to edit.

treesapp package rename prefix Photo-R-C-S-H -r ~/Desktop/PuhA_build.pkl -o ~/Desktop/

This will create the new reference package pickle ~/Desktop/Photo-R-C-S-H_build.pkl.

New features

If you have other reference package operations that you'd like to see - create a new request in the GitHub Issues tracker!