Skip to content

Snakemake workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes)

License

Notifications You must be signed in to change notification settings

biosustain/bgcflow

 
 

Repository files navigation

BGCFlow

Snakemake PEP compatible wiki Release Status example report

BGCFlow is a systematic workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes) from internal & public datasets.

At present, BGCFlow is only tested and confirmed to work on Linux systems with conda / mamba package manager.

Publication

Matin Nuhamunada, Omkar S. Mohite, Patrick V. Phaneuf, Bernhard O. Palsson, and Tilmann Weber. (2023). BGCFlow: Systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets. bioRxiv 2023.06.14.545018; doi: https://doi.org/10.1101/2023.06.14.545018

Pre-requisites

BGCFlow requires gcc and the conda/mamba package manager. See installation instruction for details.

Please use the latest version of BGCFlow available.

Quick Start

A quick and easy way to use BGCFlow using the command line interface wrapper: bgcflow-wrapper

  1. Create a conda environment and install the BGCFlow python wrapper :
# create and activate a new conda environment
conda create -n bgcflow -c conda-forge python=3.11 pip openjdk -y # also install java for metabase
conda activate bgcflow

# install `BGCFlow` wrapper
pip install bgcflow_wrapper

# make sure to use bgcflow_wrapper version >= 0.2.7
bgcflow --version
  1. Additional pre-requisites: With the environment activated, install or setup this configurations:
  • Set conda channel priorities to flexible
conda config --set channel_priority disabled
conda config --describe channel_priority
  1. Deploy and run BGCFlow, change your_bgcflow_directory variable accordingly:
# Deploy and run BGCFlow
bgcflow clone bgcflow # clone `BGCFlow` a directory named bgcflow
cd bgcflow # move to bgcflow directory
bgcflow init # initiate `BGCFlow` config and examples from template
bgcflow run -n # do a dry run, remove the flag "-n" to run the example dataset
  1. Build and serve interactive report (after bgcflow run finished). The report will be served in http://localhost:8001/. A demo of the report is available here:

    example report

# build a report
bgcflow build report

# show available projects
bgcflow serve

# serve interactive report
bgcflow serve --project Lactobacillus_delbrueckii
  • For detailed usage and configurations, have a look at the WIKI: wiki

  • Read more about bgcflow_wrapper for a detailed overview of the command line interface.

asciicast

Workflow overview

The main Snakefile workflow comprises various pipelines for data selection, functional annotation, phylogenetic analysis, genome mining, and comparative genomics for Prokaryotic datasets.

dag

Available pipelines in the main Snakefile can be checked using the following command:

bgcflow pipelines

List of Available Pipelines

Here you can find pipeline keywords that you can run using the main Snakefile of BGCflow.

Keyword Description Links
0 eggnog Annotate samples with eggNOG database (http://eggnog5.embl.de) eggnog-mapper
1 mash Calculate distance estimation for all samples using MinHash. Mash
2 fastani Do pairwise Average Nucleotide Identity (ANI) calculation across all samples. FastANI
3 automlst-wrapper Simplified Tree building using autoMLST automlst-simplified-wrapper
4 roary Build pangenome using Roary. Roary
5 eggnog-roary Annotate Roary output using eggNOG mapper eggnog-mapper
6 seqfu Calculate sequence statistics using SeqFu. seqfu2
7 bigslice Cluster BGCs using BiG-SLiCE (https://github.com/medema-group/bigslice) bigslice
8 query-bigslice Map BGCs to BiG-FAM database (https://bigfam.bioinformatics.nl/) bigfam.bioinformatics.nl
9 checkm Assess genome quality with CheckM. CheckM
10 gtdbtk Taxonomic placement with GTDB-Tk GTDBTk
11 prokka-gbk Copy annotated genbank results. prokka
12 antismash Summarizes antiSMASH result. antismash
13 arts Run Antibiotic Resistant Target Seeker (ARTS) on samples. arts
14 deeptfactor Use deep learning to find Transcription Factors. deeptfactor
15 deeptfactor-roary Use DeepTFactor on Roary outputs. Roary
16 cblaster-genome Build diamond database of genomes for cblaster search. cblaster
17 cblaster-bgc Build diamond database of BGCs for cblaster search. cblaster
18 bigscape Cluster BGCs using BiG-SCAPE BiG-SCAPE
19 gecco GEne Cluster prediction with COnditional random fields. GECCO

Development & Funding

The development of BGCFlow commenced within the Natural Products Genome Mining research group at the Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark (DTU Biosustain). BGCFlow development was/is made possible through the generous support of various funding organizations:

  • Novo Nordisk Foundation: BGCFlow development was supported by grants from the Novo Nordisk Foundation, specifically [NNF20CC0035580] and [NNF16OC0021746]. Matin Nuhamunada received support from the NNF Copenhagen Bioscience PhD Program: , grant [NNF20SA0035588].

  • Danish National Research Foundation: Additional funding was provided by the Danish National Research Foundation for the Center for Microbial Secondary Metabolites (CeMiSt), under the grant [DNRF137].

References

  • Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.
  • Mash Screen: high-throughput sequence containment estimation for genome discovery. Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM. Genome Biol. 2019 Nov 5;20(1):232. doi: 10.1186/s13059-019-1841-x.
  • Jain, C., Rodriguez-R, L.M., Phillippy, A.M. et al. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 9, 5114 (2018). https://doi.org/10.1038/s41467-018-07641-9
  • Mohammad Alanjary, Katharina Steinke, Nadine Ziemert, AutoMLST: an automated web server for generating multi-locus species trees highlighting natural product potential,Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W276–W282
  • Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, 'Roary: Rapid large-scale prokaryote pan genome analysis', Bioinformatics, 2015;31(22):3691-3693 doi:10.1093/bioinformatics/btv421
  • eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Carlos P. Cantalapiedra, Ana Hernandez-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas. 2021. Molecular Biology and Evolution, msab293
  • eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8; 47(Database issue): D309–D314. doi: 10.1093/nar/gky1085
  • Telatin, A., Birolo, G., & Fariselli, P. SeqFu [Computer software]. GITHUB: https://github.com/telatin/seqfu2
  • Satria A Kautsar, Kai Blin, Simon Shaw, Tilmann Weber, Marnix H Medema, BiG-FAM: the biosynthetic gene cluster families database, Nucleic Acids Research, gkaa812, https://doi.org/10.1093/nar/gkaa812
  • Satria A Kautsar, Justin J J van der Hooft, Dick de Ridder, Marnix H Medema, BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters, GigaScience, Volume 10, Issue 1, January 2021, giaa154.
  • Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2014. Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25: 1043-1055.
  • Chaumeil PA, et al. 2019. GTDB-Tk: A toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics, btz848.
  • Parks DH, et al. 2020. A complete domain-to-species taxonomy for Bacteria and Archaea. Nature Biotechnology, https://doi.org/10.1038/s41587-020-0501-8.
  • Parks DH, et al. 2018. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology, http://dx.doi.org/10.1038/nbt.4229.
  • Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014 Jul 15;30(14):2068-9. PMID:24642063
  • antiSMASH 6.0: improving cluster detection and comparison capabilities. Kai Blin, Simon Shaw, Alexander M Kloosterman, Zach Charlop-Powers, Gilles P van Weezel, Marnix H Medema, & Tilmann Weber. Nucleic Acids Research (2021) doi: 10.1093/nar/gkab335.
  • Mungan,M.D., Alanjary,M., Blin,K., Weber,T., Medema,M.H. and Ziemert,N. (2020) ARTS 2.0: feature updates and expansion of the Antibiotic Resistant Target Seeker for comparative genome mining. Nucleic Acids Res.,10.1093/nar/gkaa374
  • Alanjary,M., Kronmiller,B., Adamek,M., Blin,K., Weber,T., Huson,D., Philmus,B. and Ziemert,N. (2017) The Antibiotic Resistant Target Seeker (ARTS), an exploration engine for antibiotic cluster prioritization and novel drug target discovery. Nucleic Acids Res.,10.1093/nar/gkx360
  • Kim G.B., Gao Y., Palsson B.O., Lee S.Y. 2020. DeepTFactor: A deep learning-based tool for the prediction of transcription factors. PNAS. doi: 10.1073/pnas.2021171118
  • Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, 'Roary: Rapid large-scale prokaryote pan genome analysis', Bioinformatics, 2015;31(22):3691-3693 doi:10.1093/bioinformatics/btv421
  • Gilchrist, C., Booth, T. J., van Wersch, B., van Grieken, L., Medema, M. H., & Chooi, Y. (2021). cblaster: a remote search tool for rapid identification and visualisation of homologous gene clusters (Version 1.3.9) [Computer software]. https://doi.org/10.1101/2020.11.08.370601
  • Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
  • Navarro-Muñoz, J.C., Selem-Mojica, N., Mullowney, M.W. et al. A computational framework to explore large-scale biosynthetic diversity. Nat Chem Biol 16, 60–68 (2020)
  • Kai Blin, Simon Shaw, Hannah E Augustijn, Zachary L Reitz, Friederike Biermann, Mohammad Alanjary, Artem Fetter, Barbara R Terlouw, William W Metcalf, Eric J N Helfrich, Gilles P van Wezel, Marnix H Medema, Tilmann Weber, antiSMASH 7.0: new and improved predictions for detection, regulation, chemical structures and visualisation, Nucleic Acids Research, Volume 51, Issue W1, 5 July 2023, Pages W46–W50, https://doi.org/10.1093/nar/gkad344
  • Laura M Carroll, Martin Larralde, Jonas Simon Fleck, Ruby Ponnudurai, Alessio Milanese, Elisa Cappio Barazzone, Georg Zeller. 2021. Accurate de novo identification of biosynthetic gene clusters with GECCO. bioRxiv 2021.05.03.442509; doi:10.1101/2021.05.03.442509

About

Snakemake workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 60.8%
  • Jupyter Notebook 35.8%
  • Dockerfile 2.0%
  • Shell 0.5%
  • Makefile 0.4%
  • Jinja 0.3%
  • Other 0.2%