Integrating MAGs and SAGs in reference packages

Introduction

Reference packages are typically built from publicly-available protein sequences, accessioned in the NCBI or available is databases such as EggNOG and PFam. Only using data from these repositories, however, would not adequately capture the known diversity of Earth's biomes. With large evolutionary distances between the query and reference sequences, the majority of taxonomic assignments would be poorly-resolved. To properly address this, sequences from organisms that have either yet to be cultured or published may need to be integrated into select reference packages. In this tutorial we will use genomes with GTDB-tk taxonomic annotations to update the reference package for the alpha subunit of methyl coenzyme-M reductase (McrA).

Beginning with downloading all required data, we will used simple shell commands to create the input required for treesapp update.

Even though we'll be working with metagenome-assembled genomes (MAGs) during this tutorial, the same workflow would work if you wanted to update a reference package using single-cell amplified genomes (SAGs).

Supported version(s) >=0.10.0.

Prerequisites

This tutorial relies on a singularity or docker container for TreeSAPP >= 0.10.0. Alternatively, you can complete the tutorial with TreeSAPP installed via conda or pip by running commands with the singularity exec treesapp.sif parts removed.

Running the tutorial will require approximately 2 Gb of disk space. We will not be running GTDB-tk as part of the tutorial

To get started, create a directory to work in and move into it.

mkdir ts_tutorial; cd ts_tutorial/

Pull the singularity file (if you haven't already) into this directory and check its version.

singularity pull --arch amd64 library://cmorganl/default/treesapp:latest
mv treesapp_latest.sif treesapp.sif
singularity exec treesapp.sif treesapp info

Next, we need to download the fasta files we'll be using to update the McrA reference package. I've selected five recent publications that describe new organisms with the potential for C1-metabolism. These studies sequenced metagenomes and binned contigs into MAGs. All studies also identified at least one novel McrA homolog in a genome, so let's hope we can reproduce this using TreeSAPP's McrA reference package.

Click on each of the organisms to initiate downloading its genome assembly from GenBank.

Organism	Environment	Study
Archaeoglobi archaeon	Juan de Fuca Ridge Subseafloor	1
Methanoperedenaceae	Bioreactor	2
Methanoperedenaceae	Bioreactor	2
Candidatus Korarchaeum cryptofilum	Geothermal spring in Yellowstone National Park	3
Candidatus Methanodesulfokores washburnensis	Geothermal spring in Yellowstone National Park	3

The FASTA headers will need to be reformatted so TreeSAPP can link the GTDB genome assignments to each query sequence. This can be accomplished with a bash one-liner:

for f in *gz; do name=$( echo $f | sed 's/_genomic.fna.gz//g' ); gunzip -c $f | sed "s/^>/>${name}_/g" >$name.fasta; done

Feel free to remove the original gzipped assemblies with rm *_genomic.fna.gz.

The original reference package that we will update is available for download here.

We'll also need the original FASTA file used for creating the seed reference package.

With ls, the contents of your directory should look like this:

GCA_003935005.1_ASM393500v1.fasta  GCA_003948265.1_ASM394826v1.fasta   GCA_012026835.1_ASM1202683v1.fasta  treesapp.sif
GCA_003947435.1_ASM394743v1.fasta  GCA_012026795.1_ASM1202679v1.fasta  refpkgs   McrA_candidate_references_2020-01-22_uclust99.faa

Classifying genome sequences

As discussed here and here, treesapp update relies on the outputs from treesapp assign to determine which sequences to update a reference package with.

Since we've renamed the sequences in each of the FASTA files, linking them to their genomes we can concatenate them for more efficient processing - running treesapp assign once, rather than for each genome.

cat GCA*fasta >archaea_mcra_queries.fasta

Now, we can proceed to run treesapp assign on these genomes

treesapp assign -n 4 --trim_align --refpkg_dir refpkgs/ -i archaea_mcra_queries.fasta -o treesapp_assign_output/

You can browse the classification table produced using less to see whether McrA sequences were found in any or all of the genomes.

less treesapp_assign_output/final_outputs/marker_contig_map.tsv

Five McrA sequences were identified but it looks like McrA was not found in the Candidatus Korarchaeum cryptofilum assembly (GCA_003948265.1_ASM394826v1). Fortunately, the authors did not suggest this organism contained McrA either. Also, we were able to find two copies of McrA in the Archaeoglobi archaeon, an observation also noted by the authors of that study.

Working with GTDB-tk output

GTDB² is an exciting project aiming to improve the taxonomy for Archaea and Bacteria through standardized genome-based comparisons. While adhering to the currently accepted taxonomic hierarchy, the developers have defined a reference tree of life upon which query genomes can be placed and assigned a putative taxonomic label according to their evolutionary distance to the most closely related references. These distance measurements are from a variety of sources, including phylogenetic placement and MinHash. Any genome can be queried by using GTDB-tk³, and this is how many people are assigning taxonomic labels to their SAGs and MAGs. Likewise, I have used GTDB-tk to assign taxonomic lineages to the MAGs we're using in this tutorial.

Unfortunately, the computational resources required to run GTDB-tk far exceed a typical laptop, so the outputs have been pre-baked. The command used was

gtdbtk classify_wf --genome_dir archaea_genomes/ --out_dir archaea_genomes/gtdbtk_output/ -x fasta --cpus 4

The file we will need to update the McrA reference package is gtdbtk.ar122.summary.tsv. We can pull out the fields we need using the awk command and reformat some of the fields with sed. Typically, there is an accompanying summary.tsv file for the bacterial genomes but since the genomes we're using to update the tree are all Archaea we only need the archaeal summary.tsv file.

awk -F"\t" '{ OFS="\t"; print $1,$2 }' gtdbtk.ar122.summary.tsv | \
sed 's/user_genome/Organism/g' | \
sed 's/classification/Lineage/g' | \
sed 's/;/; /g' >gtdb_classifications.tsv

We now have everything we need to update the original reference package.

Updating McrA

At this point, adding these new sequences is one command away.

Just like when we update the reference package with sequences that are accessioned in the NCBI, or have NCBI taxid labels, we need to use the --skip_assign flag. This lets TreeSAPP know it should look in the --seqs2lineage file for taxonomic labels of the five new candidate reference sequences.

treesapp update \
--refpkg_path refpkgs/McrA_build.pkl \
--treesapp_output treesapp_assign_output/ \
--output McrA_update/ \
--seqs2lineage gtdb_classifications.tsv \
--fastx_input McrA_candidate_references_2020-01-22_uclust99.faa \
--cluster -m prot --fast --skip_assign --delete --headless

Three out of the five candidate sequences were included in the updated McrA reference package, bringing the total number of reference sequences up from 249 to 252.

Feel free to identify which ones were included using treesapp package view.

References

Boyd, J. A., Jungbluth, S. P., Leu, A. O., Evans, P. N., Woodcroft, B. J., Chadwick, G. L., … Tyson, G. W. (2019). Divergent methyl-coenzyme M reductase genes in a deep-subseafloor Archaeoglobi. The ISME Journal, doi: https://doi.org/10.1101/390617. https://doi.org/10.1038/s41396-018-0343-2
Leu, A. O., Cai, C., McIlroy, S. J., Southam, G., Orphan, V. J., Yuan, Z., … Tyson, G. W. (2020). Anaerobic methane oxidation coupled to manganese reduction by members of the Methanoperedenaceae. ISME Journal. https://doi.org/10.1038/s41396-020-0590-x
McKay, L. J., Dlakić, M., Fields, M. W., Delmont, T. O., Eren, A. M., Jay, Z. J., … Inskeep, W. P. (2019). Co-occurring genomic capacity for anaerobic methane and dissimilatory sulfur metabolisms discovered in the Korarchaeota. Nature Microbiology, 4(April). https://doi.org/10.1038/s41564-019-0362-4
Parks, D. H., Chuvochina, M., Waite, D. W., Rinke, C., Skarshewski, A., Chaumeil, P.-A., & Hugenholtz, P. (2018). A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology, 36(10), 996–1004. https://doi.org/10.1038/nbt.4229
Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P., & Parks, D. H. (2019). GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics, 36(November), 1–3. https://doi.org/10.1093/bioinformatics/btz848 Alves, R. J. E., Minh, B. Q., Urich, T., von Haeseler, A., & Schleper, C. (2018). Unifying the global phylogeny and environmental distribution of ammonia-oxidising archaea based on amoA genes. Nature Communications, 9(1), 1517. https://doi.org/10.1038/s41467-018-03861-1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly