Skip to content

Download and pre-process genomes for the Scrambling in the Tree of Life project

License

Notifications You must be signed in to change notification settings

oist/LuscombeU_stlpreprocess

Repository files navigation

Genome download and pre-processing pipeline

Introduction

oist/LuscombeU_stlpreprocess is a bioinformatics pipeline to …

  1. Extract chromosomal scaffolds from the assembly file (discard unplaced, alternate, organelle and plasmid sequences, etc.).
  2. Unmask the genome (to be re-masked later by another local pipeline).
  3. Extract complete organelle genomes from the assembly file (they might be useful later as an internal control).
  4. Summarise the occurrence of the first two letters of the accession numbers, to ease future changes of the grepping pattern for whole-chromosome scaffolds.
  5. Record the name of the contigs, for instance to check if sex chromosomes are missing from the assembly.
  6. Re-compress the assemblies with bgzip, for future uses such as CRAM compression.
  7. Show in the MultiQC report some assembly statistics such as GC content and contig length extracted with the https://github.com/rpetit3/assembly-scan software.

After running this pipeline, you can follow with repeat masking using https://github.com/oist/LuscombeU_stlrepeatmask.

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.tsv:

id	file
genome1	/path/to/genome/file.fastq.gz
genome2	https://url.example.com/to/genome/file.fastq.gz
…

Now, you can run the pipeline using:

nextflow run oist/LuscombeU_stlpreprocess -r master \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.tsv \
   --outdir <OUTDIR>

The -r master option selects the branch or version of the pipeline. Alternatives are -r dev for the latest version in development or version numbers such as -r 3.0.0 for instance.

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Resource usage

  • On annelids, assembly-scan took a maximum of 2 GB memory. Filtering is now very lean, using less than 300 MB. All tasks completed in less than 40 min.

Use the --assemblyscan_memory parameter to give more memory to assembly-scan. The default is 6.GB. If not all the genomes are big, let the pipeline first process the small ones with default parameters, and then run it again with -resume and --assemblyscan_memory.

Pattern and exceptions

The current pattern, CM|CP|FR|L[R-T]|NC|NZ|O[U-Z] matches complete chromosome scaffolds, plasmids and organelles almost exclusively. However there are exceptions.

  • Drosophila melanogaster's GCA_000001215 uses AE for chromosome scaffolds and CP for chrY and unplaced scaffolds.
  • Brassica rapa's GCA_900412535.3 uses LS for chromosomes and OV for shotgun scaffolds.
  • Brassica oleracea GCA_900416815: LS / OW.
  • _Strongyloides_ratti_GCA_001040885: has only LN` for both chromosome and unplaced scaffold sequences.
  • Caenorhabditis inopinata GCA_003052745.1: AP.
  • Caenorhabditis elegans GCA_000002985.3: BX.
  • AE is rare and appears to be found only in chromosome sequences of old assemblies such as GCA_000001215.4 (D. melanogaster), GCA_000008565.1 (Deinococcus radiodurans), or GCA_000008125.1 (T. thermophilus). However it is also in unplaced sequences of GCA_000309985.3 (Brassica rapa). Altogether, it is better not to allow it.

To find the names of the genomes where nothing was extracted, try:

basename -s .patterns.txt *.patterns.txt | sed 's/$/.chromosomes_unmasked.fa.gz/' | xargs ls > /dev/null

To check if a new pattern would be suitable, try:

find . -name *patterns.txt | xargs grep -l AP | xargs head

Credits

oist/LuscombeU_stlpreprocess was originally written by @charles-plessy.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

Pipeline tools

  • Samtools

    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PMID: 19505943; PMCID: PMC2723002.

  • MultiQC

    Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

Software packaging/containerisation tools

  • Anaconda

    Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

  • Bioconda

    Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

  • BioContainers

    da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

  • Docker

    Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

  • Singularity

    Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.