sanger-tol · muffato · Apr 17, 2024 · Feb 9, 2024 · Feb 9, 2024 · Feb 26, 2024
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,35 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [[0.4.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.4.0)] – Buneary – [2024-03-28]
+
+The pipeline has now been validated on dozens of genomes, up to 11 Gbp.
+
+### Enhancements & fixes
+
+- Upgraded the version of `blobtools`, which enables a better reporting of
+  wrong accession numbers and a better handling of oddities in input files.
+- Files in the output blobdir are now compressed.
+- All modules handling blobdirs can now be cached.
+- Large genomes supported, up to at least 11 Gbp.
+- Allow all variations of FASTA and FASTQ extensions for input.
+- More fields included in the trace files.
+- All nf-core modules updated
+
+### Software dependencies
+
+Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.
+
+| Dependency  | Old version   | New version   |
+| ----------- | ------------- | ------------- |
+| blobtoolkit | 4.3.3         | 4.3.9         |
+| blast       | 2.14.0        | 2.15.0        |
+| multiqc     | 1.17 and 1.18 | 1.20 and 1.21 |
+| samtools    | 1.18          | 1.19.2        |
+| seqtk       | 1.3           | 1.4           |
+
+> **NB:** Dependency has been **updated** if both old and new version information is present. </br> **NB:** Dependency has been **added** if just the new version information is present. </br> **NB:** Dependency has been **removed** if version information isn't present.
+
 ## [[0.3.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.3.0)] – Poliwag – [2024-02-09]
 
 The pipeline has now been validated on five genomes, all under 100 Mbp: a
@@ -33,6 +62,16 @@ sponge, a platyhelminth, and three fungi.
 
 > **NB:** Parameter has been **updated** if both old and new parameter information is present. </br> **NB:** Parameter has been **added** if just the new parameter information is present. </br> **NB:** Parameter has been **removed** if new parameter information isn't present.
 
+### Software dependencies
+
+Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.
+
+| Dependency  | Old version | New version |
+| ----------- | ----------- | ----------- |
+| blobtoolkit | 4.3.2       | 4.3.3       |
+
+> **NB:** Dependency has been **updated** if both old and new version information is present. </br> **NB:** Dependency has been **added** if just the new version information is present. </br> **NB:** Dependency has been **removed** if version information isn't present.
+
 ## [[0.2.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.2.0)] – Pikachu – [2023-12-22]
 
 ### Enhancements & fixes

diff --git a/README.md b/README.md
@@ -11,7 +11,8 @@
 
 ## Introduction
 
-**sanger-tol/blobtoolkit** is a bioinformatics pipeline that can be used to identify and analyse non-target DNA for eukaryotic genomes. It takes a samplesheet and aligned CRAM files as input, calculates genome statistics, coverage and completeness information, combines them in a TSV file by window size to create a BlobDir dataset and static plots.
+**sanger-tol/blobtoolkit** is a bioinformatics pipeline that can be used to identify and analyse non-target DNA for eukaryotic genomes.
+It takes a samplesheet of BAM/CRAM/FASTQ/FASTA files as input, calculates genome statistics, coverage and completeness information, combines them in a TSV file by window size to create a BlobDir dataset and static plots.
 
 1. Calculate genome statistics in windows ([`fastawindows`](https://github.com/tolkit/fasta_windows))
 2. Calculate Coverage ([`blobtk/depth`](https://github.com/blobtoolkit/blobtk))

diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -21,8 +21,8 @@
             },
             "datafile": {
                 "type": "string",
-                "pattern": "^\\S+\\.cram$",
-                "errorMessage": "Data file for reads cannot contain spaces and must have extension 'cram'"
+                "pattern": "^\\S+\\.(bam|cram|fa|fa.gz|fasta|fasta.gz|fq|fq.gz|fastq|fastq.gz)$",
+                "errorMessage": "Data file for reads cannot contain spaces and must be BAM/CRAM/FASTQ/FASTA"
             }
         },
         "required": ["datafile", "datatype", "sample"]

diff --git a/bin/check_samplesheet.py b/bin/check_samplesheet.py
@@ -27,8 +27,14 @@ class RowChecker:
     VALID_FORMATS = (
         ".cram",
         ".bam",
+        ".fq",
+        ".fq.gz",
         ".fastq",
         ".fastq.gz",
+        ".fa",
+        ".fa.gz",
+        ".fasta",
+        ".fasta.gz",
     )
 
     VALID_DATATYPES = (

diff --git a/bin/update_versions.py b/bin/update_versions.py
@@ -12,9 +12,10 @@ def parse_args(args=None):
     Description = "Combine BED files to create window stats input file."
 
     parser = argparse.ArgumentParser(description=Description)
-    parser.add_argument("--meta", help="Input JSON file.", required=True)
+    parser.add_argument("--meta_in", help="Input JSON file.", required=True)
+    parser.add_argument("--meta_out", help="Output JSON file.", required=True)
     parser.add_argument("--software", help="Input YAML file.", required=True)
-    parser.add_argument("--version", action="version", version="%(prog)s 1.0.0")
+    parser.add_argument("--version", action="version", version="%(prog)s 1.1.0")
     return parser.parse_args(args)
 
 
@@ -41,8 +42,8 @@ def update_meta(meta, software):
 def main(args=None):
     args = parse_args(args)
 
-    data = update_meta(args.meta, args.software)
-    with open(args.meta, "w") as fh:
+    data = update_meta(args.meta_in, args.software)
+    with open(args.meta_out, "w") as fh:
         json.dump(data, fh)
 
 

diff --git a/conf/base.config b/conf/base.config
@@ -52,6 +52,58 @@ process {
     withLabel:process_high_memory {
         memory = { check_max( 200.GB * task.attempt, 'memory' ) }
     }
+
+    withName: '.*:MINIMAP2_ALIGNMENT:MINIMAP2_CCS' {
+        cpus   = { log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) }
+        memory = { check_max( 800.MB * log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) + 14.GB * Math.ceil( Math.pow(meta2.genome_size / 1000000000, 0.6)) * task.attempt, 'memory' ) }
+        time   = { check_max(        4.h  * Math.ceil( meta.read_count   / 1000000   ) * task.attempt, 'time'   ) }
+    }
+
+    // Extrapolated from the HIFI settings on the basis of 1 ONT alignment. CLR assumed to behave the same way as ONT
+    withName: '.*:MINIMAP2_ALIGNMENT:MINIMAP2_(CLR|ONT)' {
+        cpus   = { log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) }
+        memory = { check_max( 800.MB * log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) + 30.GB * Math.ceil( Math.pow(meta2.genome_size / 1000000000, 0.6)) * task.attempt, 'memory' ) }
+        time   = { check_max(        1.h  * Math.ceil( meta.read_count   / 1000000   ) * task.attempt, 'time'   ) }
+    }
+
+    // Temporarily the same settings as CCS
+    withName: '.*:MINIMAP2_ALIGNMENT:MINIMAP2_(HIC|ILMN)' {
+        cpus   = { log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) }
+        memory = { check_max( 800.MB * log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) + 14.GB * Math.ceil( Math.pow(meta2.genome_size / 1000000000, 0.6)) * task.attempt, 'memory' ) }
+        time   = { check_max(        3.h  * Math.ceil( meta.read_count   / 1000000   ) * task.attempt, 'time'   ) }
+    }
+
+    withName: 'WINDOWSTATS_INPUT' {
+        cpus   = { check_max( 1                  , 'cpus'    ) }
+        // 2 GB per 1 Gbp
+        memory = { check_max( 2.GB * task.attempt * Math.ceil(meta.genome_size / 1000000000), 'memory' ) }
+        time   = { check_max( 4.h  * task.attempt, 'time'    ) }
+    }
+
+    withName: 'BLOBTOOLKIT_WINDOWSTATS' {
+        cpus   = { check_max( 1                  , 'cpus'    ) }
+        // 3 GB per 1 Gbp
+        memory = { check_max( 3.GB * task.attempt * Math.ceil(meta.genome_size / 1000000000), 'memory' ) }
+        time   = { check_max( 4.h  * task.attempt, 'time'    ) }
+    }
+
+    withName: 'FASTAWINDOWS' {
+        // 1 CPU per 1 Gbp
+        cpus   = { check_max( Math.ceil(meta.genome_size / 1000000000), 'cpus' ) }
+        // 100 MB per 45 Mbp
+        memory = { check_max( 100.MB * task.attempt * Math.ceil(meta.genome_size / 45000000), 'memory' ) }
+    }
+
+    withName: BUSCO {
+        // The formulas below are equivalent to these ranges:
+        // Gbp:    [ 1,  2,  4,   8,  16]
+        // CPUs:   [ 8, 12, 16,  20,  24]
+        // GB RAM: [16, 32, 64, 128, 256]
+        memory = { check_max( 1.GB * Math.pow(2, 3 + task.attempt + Math.ceil(positive_log(meta.genome_size/1000000000, 2))) , 'memory' ) }
+        cpus   = { log_increase_cpus(4, 4*task.attempt, Math.ceil(meta.genome_size/1000000000), 2) }
+        time   = { check_max( 3.h * Math.ceil(meta.genome_size/1000000000) * task.attempt, 'time') }
+    }
+
     withName:CUSTOM_DUMPSOFTWAREVERSIONS {
         cache = false
     }

diff --git a/conf/modules.config b/conf/modules.config
@@ -29,23 +29,23 @@ process {
     }
 
     withName: "MINIMAP2_HIC" {
-        ext.args = "-ax sr"
+        ext.args = { "-ax sr -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
     }
 
     withName: "MINIMAP2_ILMN" {
-        ext.args = "-ax sr"
+        ext.args = { "-ax sr -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
     }
 
     withName: "MINIMAP2_CCS" {
-        ext.args = "-ax map-hifi --cs=short"
+        ext.args = { "-ax map-hifi --cs=short -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
     }
 
     withName: "MINIMAP2_CLR" {
-        ext.args = "-ax map-pb"
+        ext.args = { "-ax map-pb -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
     }
 
     withName: "MINIMAP2_ONT" {
-        ext.args = "-ax map-ont"
+        ext.args = { "-ax map-ont -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
     }
 
     withName: "SAMTOOLS_VIEW" {
@@ -67,6 +67,9 @@ process {
                         // Note: BUSCO *must* see the double-quotes around the parameters
                         '--force --metaeuk_parameters \'"-s=2"\' --metaeuk_rerun_parameters \'"-s=2"\''
                     : '--force' }
+    }
+
+    withName: "RESTRUCTUREBUSCODIR" {
         publishDir = [
             path: { "${params.outdir}/busco" },
             mode: params.publish_dir_mode,
@@ -98,22 +101,6 @@ process {
         ext.args = "--evalue 1.0e-25 --hit-count 10"
     }
 
-    withName: "BLOBTOOLKIT_SUMMARY" {
-        publishDir = [
-            path: { "${params.outdir}/blobtoolkit/${blobdir.name}" },
-            mode: params.publish_dir_mode,
-            saveAs: { filename -> filename.equals("versions.yml") ? null : filename }
-        ]
-    }
-
-    withName: "BLOBTK_IMAGES" {
-        publishDir = [
-            path: { "${params.outdir}/blobtoolkit/plots" },
-            mode: params.publish_dir_mode,
-            saveAs: { filename -> filename.equals("versions.yml") ? null : filename }
-        ]
-    }
-
     withName: "BLOBTOOLKIT_CHUNK" {
         ext.args = "--chunk 100000 --overlap 0 --max-chunks 10 --min-length 1000"
     }
@@ -138,14 +125,22 @@ process {
         ]
     }
 
-    withName: "BLOBTOOLKIT_UPDATEMETA" {
+    withName: "COMPRESSBLOBDIR" {
         publishDir = [
             path: { "${params.outdir}/blobtoolkit" },
             mode: params.publish_dir_mode,
             saveAs: { filename -> filename.equals("versions.yml") ? null : filename }
         ]
     }
 
+    withName: "BLOBTK_IMAGES" {
+        publishDir = [
+            path: { "${params.outdir}/blobtoolkit/plots" },
+            mode: params.publish_dir_mode,
+            saveAs: { filename -> filename.equals("versions.yml") ? null : filename }
+        ]
+    }
+
     withName: 'MULTIQC' {
         ext.args   = { params.multiqc_title ? "--title \"$params.multiqc_title\"" : '' }
         publishDir = [

diff --git a/docs/output.md b/docs/output.md
@@ -8,13 +8,13 @@ The directories listed below will be created in the results directory after the
 
 The directories comply with Tree of Life's canonical directory structure.
 
-<!-- Write this documentation describing your workflow's output -->
-
 ## Pipeline overview
 
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 
-- [BlobDir](#blobdir) - Output files from `blobtools` and `view` subworkflow
+- [BlobDir](#blobdir) - Output files viewable on a [BlobToolKit viewer](https://github.com/blobtoolkit/blobtoolkit)
+- [Static plots](#static-plots) - Static versions of the BlobToolKit plots
+- [BUSCO](#busco) - BUSCO results
 - [MultiQC](#multiqc) - Aggregate report describing results from the whole pipeline
 - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
 
@@ -25,14 +25,43 @@ The files in the BlobDir dataset which is used to create the online interactive
 <details markdown="1">
 <summary>Output files</summary>
 
-- `<accession>/`
-  - `*.json`: files generated from genome and alignment coverage statistics
-  - `*.png`: static plot images
+- `blobtoolkit/`
+  - `<accession>/`
+    - `*.json.gz`: files generated from genome and alignment coverage statistics
 
 More information about visualising the data in the [BlobToolKit repository](https://github.com/blobtoolkit/blobtoolkit/tree/main/src/viewer)
 
 </details>
 
+### Static plots
+
+Images generated from the above blobdir using the [blobtk](https://github.com/blobtoolkit/blobtk) tool.
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `blobtoolkit/`
+  - `plots/`
+    - `*.png` or `*.svg`, depending on the selected output format: static versions of the BlobToolKit plots.
+
+</details>
+
+### BUSCO
+
+BUSCO results generated by the pipeline (all BUSCO lineages that match the claassification of the species).
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `blobtoolkit/`
+  - `busco/`
+    - `*.batch_summary.txt`: BUSCO scores as tab-separated files (1 file per lineage).
+    - `*.fasta.txt`: BUSCO scores as formatted text (1 file per lineage).
+    - `*.json`: BUSCO scores as JSON (1 file per lineage).
+    - `*/`: all output BUSCO files, including the coordinate and sequence files of the annotated genes.
+
+</details>
+
 ### MultiQC
 
 <details markdown="1">

diff --git a/docs/usage.md b/docs/usage.md
@@ -229,8 +229,8 @@ List of tools for any given dataset can be fetched from the API, for example htt
 
 | Dependency        | Snakemake | Nextflow |
 | ----------------- | --------- | -------- |
-| blobtoolkit       | 4.3.2     | 4.3.2    |
-| blast             | 2.12.0    | 2.14.1   |
+| blobtoolkit       | 4.3.2     | 4.3.9    |
+| blast             | 2.12.0    | 2.15.0   |
 | blobtk            | 0.5.0     | 0.5.1    |
 | busco             | 5.3.2     | 5.5.0    |
 | diamond           | 2.0.15    | 2.1.8    |
@@ -240,8 +240,8 @@ List of tools for any given dataset can be fetched from the API, for example htt
 | ncbi-datasets-cli | 14.1.0    |          |
 | nextflow          |           | 23.10.0  |
 | python            | 3.9.13    | 3.12.0   |
-| samtools          | 1.15.1    | 1.18     |
-| seqtk             | 1.3       |          |
+| samtools          | 1.15.1    | 1.19.2   |
+| seqtk             | 1.3       | 1.4      |
 | snakemake         | 7.19.1    |          |
 | windowmasker      | 2.12.0    | 2.14.0   |