Skip to content

Commit

Permalink
Merge pull request #95 from sanger-tol/prod_fix
Browse files Browse the repository at this point in the history
Fixes for production, March 2024
  • Loading branch information
muffato authored Apr 17, 2024
2 parents 2a8f760 + 8af7fa8 commit f519c4e
Show file tree
Hide file tree
Showing 89 changed files with 1,959 additions and 474 deletions.
39 changes: 39 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,35 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [[0.4.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.4.0)] – Buneary – [2024-03-28]

The pipeline has now been validated on dozens of genomes, up to 11 Gbp.

### Enhancements & fixes

- Upgraded the version of `blobtools`, which enables a better reporting of
wrong accession numbers and a better handling of oddities in input files.
- Files in the output blobdir are now compressed.
- All modules handling blobdirs can now be cached.
- Large genomes supported, up to at least 11 Gbp.
- Allow all variations of FASTA and FASTQ extensions for input.
- More fields included in the trace files.
- All nf-core modules updated

### Software dependencies

Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.

| Dependency | Old version | New version |
| ----------- | ------------- | ------------- |
| blobtoolkit | 4.3.3 | 4.3.9 |
| blast | 2.14.0 | 2.15.0 |
| multiqc | 1.17 and 1.18 | 1.20 and 1.21 |
| samtools | 1.18 | 1.19.2 |
| seqtk | 1.3 | 1.4 |

> **NB:** Dependency has been **updated** if both old and new version information is present. </br> **NB:** Dependency has been **added** if just the new version information is present. </br> **NB:** Dependency has been **removed** if version information isn't present.
## [[0.3.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.3.0)] – Poliwag – [2024-02-09]

The pipeline has now been validated on five genomes, all under 100 Mbp: a
Expand Down Expand Up @@ -33,6 +62,16 @@ sponge, a platyhelminth, and three fungi.

> **NB:** Parameter has been **updated** if both old and new parameter information is present. </br> **NB:** Parameter has been **added** if just the new parameter information is present. </br> **NB:** Parameter has been **removed** if new parameter information isn't present.
### Software dependencies

Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.

| Dependency | Old version | New version |
| ----------- | ----------- | ----------- |
| blobtoolkit | 4.3.2 | 4.3.3 |

> **NB:** Dependency has been **updated** if both old and new version information is present. </br> **NB:** Dependency has been **added** if just the new version information is present. </br> **NB:** Dependency has been **removed** if version information isn't present.
## [[0.2.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.2.0)] – Pikachu – [2023-12-22]

### Enhancements & fixes
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@

## Introduction

**sanger-tol/blobtoolkit** is a bioinformatics pipeline that can be used to identify and analyse non-target DNA for eukaryotic genomes. It takes a samplesheet and aligned CRAM files as input, calculates genome statistics, coverage and completeness information, combines them in a TSV file by window size to create a BlobDir dataset and static plots.
**sanger-tol/blobtoolkit** is a bioinformatics pipeline that can be used to identify and analyse non-target DNA for eukaryotic genomes.
It takes a samplesheet of BAM/CRAM/FASTQ/FASTA files as input, calculates genome statistics, coverage and completeness information, combines them in a TSV file by window size to create a BlobDir dataset and static plots.

1. Calculate genome statistics in windows ([`fastawindows`](https://github.com/tolkit/fasta_windows))
2. Calculate Coverage ([`blobtk/depth`](https://github.com/blobtoolkit/blobtk))
Expand Down
4 changes: 2 additions & 2 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@
},
"datafile": {
"type": "string",
"pattern": "^\\S+\\.cram$",
"errorMessage": "Data file for reads cannot contain spaces and must have extension 'cram'"
"pattern": "^\\S+\\.(bam|cram|fa|fa.gz|fasta|fasta.gz|fq|fq.gz|fastq|fastq.gz)$",
"errorMessage": "Data file for reads cannot contain spaces and must be BAM/CRAM/FASTQ/FASTA"
}
},
"required": ["datafile", "datatype", "sample"]
Expand Down
6 changes: 6 additions & 0 deletions bin/check_samplesheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,14 @@ class RowChecker:
VALID_FORMATS = (
".cram",
".bam",
".fq",
".fq.gz",
".fastq",
".fastq.gz",
".fa",
".fa.gz",
".fasta",
".fasta.gz",
)

VALID_DATATYPES = (
Expand Down
9 changes: 5 additions & 4 deletions bin/update_versions.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,10 @@ def parse_args(args=None):
Description = "Combine BED files to create window stats input file."

parser = argparse.ArgumentParser(description=Description)
parser.add_argument("--meta", help="Input JSON file.", required=True)
parser.add_argument("--meta_in", help="Input JSON file.", required=True)
parser.add_argument("--meta_out", help="Output JSON file.", required=True)
parser.add_argument("--software", help="Input YAML file.", required=True)
parser.add_argument("--version", action="version", version="%(prog)s 1.0.0")
parser.add_argument("--version", action="version", version="%(prog)s 1.1.0")
return parser.parse_args(args)


Expand All @@ -41,8 +42,8 @@ def update_meta(meta, software):
def main(args=None):
args = parse_args(args)

data = update_meta(args.meta, args.software)
with open(args.meta, "w") as fh:
data = update_meta(args.meta_in, args.software)
with open(args.meta_out, "w") as fh:
json.dump(data, fh)


Expand Down
52 changes: 52 additions & 0 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,58 @@ process {
withLabel:process_high_memory {
memory = { check_max( 200.GB * task.attempt, 'memory' ) }
}

withName: '.*:MINIMAP2_ALIGNMENT:MINIMAP2_CCS' {
cpus = { log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) }
memory = { check_max( 800.MB * log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) + 14.GB * Math.ceil( Math.pow(meta2.genome_size / 1000000000, 0.6)) * task.attempt, 'memory' ) }
time = { check_max( 4.h * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'time' ) }
}

// Extrapolated from the HIFI settings on the basis of 1 ONT alignment. CLR assumed to behave the same way as ONT
withName: '.*:MINIMAP2_ALIGNMENT:MINIMAP2_(CLR|ONT)' {
cpus = { log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) }
memory = { check_max( 800.MB * log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) + 30.GB * Math.ceil( Math.pow(meta2.genome_size / 1000000000, 0.6)) * task.attempt, 'memory' ) }
time = { check_max( 1.h * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'time' ) }
}

// Temporarily the same settings as CCS
withName: '.*:MINIMAP2_ALIGNMENT:MINIMAP2_(HIC|ILMN)' {
cpus = { log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) }
memory = { check_max( 800.MB * log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) + 14.GB * Math.ceil( Math.pow(meta2.genome_size / 1000000000, 0.6)) * task.attempt, 'memory' ) }
time = { check_max( 3.h * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'time' ) }
}

withName: 'WINDOWSTATS_INPUT' {
cpus = { check_max( 1 , 'cpus' ) }
// 2 GB per 1 Gbp
memory = { check_max( 2.GB * task.attempt * Math.ceil(meta.genome_size / 1000000000), 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
}

withName: 'BLOBTOOLKIT_WINDOWSTATS' {
cpus = { check_max( 1 , 'cpus' ) }
// 3 GB per 1 Gbp
memory = { check_max( 3.GB * task.attempt * Math.ceil(meta.genome_size / 1000000000), 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
}

withName: 'FASTAWINDOWS' {
// 1 CPU per 1 Gbp
cpus = { check_max( Math.ceil(meta.genome_size / 1000000000), 'cpus' ) }
// 100 MB per 45 Mbp
memory = { check_max( 100.MB * task.attempt * Math.ceil(meta.genome_size / 45000000), 'memory' ) }
}

withName: BUSCO {
// The formulas below are equivalent to these ranges:
// Gbp: [ 1, 2, 4, 8, 16]
// CPUs: [ 8, 12, 16, 20, 24]
// GB RAM: [16, 32, 64, 128, 256]
memory = { check_max( 1.GB * Math.pow(2, 3 + task.attempt + Math.ceil(positive_log(meta.genome_size/1000000000, 2))) , 'memory' ) }
cpus = { log_increase_cpus(4, 4*task.attempt, Math.ceil(meta.genome_size/1000000000), 2) }
time = { check_max( 3.h * Math.ceil(meta.genome_size/1000000000) * task.attempt, 'time') }
}

withName:CUSTOM_DUMPSOFTWAREVERSIONS {
cache = false
}
Expand Down
39 changes: 17 additions & 22 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -29,23 +29,23 @@ process {
}

withName: "MINIMAP2_HIC" {
ext.args = "-ax sr"
ext.args = { "-ax sr -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
}

withName: "MINIMAP2_ILMN" {
ext.args = "-ax sr"
ext.args = { "-ax sr -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
}

withName: "MINIMAP2_CCS" {
ext.args = "-ax map-hifi --cs=short"
ext.args = { "-ax map-hifi --cs=short -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
}

withName: "MINIMAP2_CLR" {
ext.args = "-ax map-pb"
ext.args = { "-ax map-pb -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
}

withName: "MINIMAP2_ONT" {
ext.args = "-ax map-ont"
ext.args = { "-ax map-ont -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
}

withName: "SAMTOOLS_VIEW" {
Expand All @@ -67,6 +67,9 @@ process {
// Note: BUSCO *must* see the double-quotes around the parameters
'--force --metaeuk_parameters \'"-s=2"\' --metaeuk_rerun_parameters \'"-s=2"\''
: '--force' }
}

withName: "RESTRUCTUREBUSCODIR" {
publishDir = [
path: { "${params.outdir}/busco" },
mode: params.publish_dir_mode,
Expand Down Expand Up @@ -98,22 +101,6 @@ process {
ext.args = "--evalue 1.0e-25 --hit-count 10"
}

withName: "BLOBTOOLKIT_SUMMARY" {
publishDir = [
path: { "${params.outdir}/blobtoolkit/${blobdir.name}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals("versions.yml") ? null : filename }
]
}

withName: "BLOBTK_IMAGES" {
publishDir = [
path: { "${params.outdir}/blobtoolkit/plots" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals("versions.yml") ? null : filename }
]
}

withName: "BLOBTOOLKIT_CHUNK" {
ext.args = "--chunk 100000 --overlap 0 --max-chunks 10 --min-length 1000"
}
Expand All @@ -138,14 +125,22 @@ process {
]
}

withName: "BLOBTOOLKIT_UPDATEMETA" {
withName: "COMPRESSBLOBDIR" {
publishDir = [
path: { "${params.outdir}/blobtoolkit" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals("versions.yml") ? null : filename }
]
}

withName: "BLOBTK_IMAGES" {
publishDir = [
path: { "${params.outdir}/blobtoolkit/plots" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals("versions.yml") ? null : filename }
]
}

withName: 'MULTIQC' {
ext.args = { params.multiqc_title ? "--title \"$params.multiqc_title\"" : '' }
publishDir = [
Expand Down
41 changes: 35 additions & 6 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ The directories listed below will be created in the results directory after the

The directories comply with Tree of Life's canonical directory structure.

<!-- Write this documentation describing your workflow's output -->

## Pipeline overview

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [BlobDir](#blobdir) - Output files from `blobtools` and `view` subworkflow
- [BlobDir](#blobdir) - Output files viewable on a [BlobToolKit viewer](https://github.com/blobtoolkit/blobtoolkit)
- [Static plots](#static-plots) - Static versions of the BlobToolKit plots
- [BUSCO](#busco) - BUSCO results
- [MultiQC](#multiqc) - Aggregate report describing results from the whole pipeline
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution

Expand All @@ -25,14 +25,43 @@ The files in the BlobDir dataset which is used to create the online interactive
<details markdown="1">
<summary>Output files</summary>

- `<accession>/`
- `*.json`: files generated from genome and alignment coverage statistics
- `*.png`: static plot images
- `blobtoolkit/`
- `<accession>/`
- `*.json.gz`: files generated from genome and alignment coverage statistics

More information about visualising the data in the [BlobToolKit repository](https://github.com/blobtoolkit/blobtoolkit/tree/main/src/viewer)

</details>

### Static plots

Images generated from the above blobdir using the [blobtk](https://github.com/blobtoolkit/blobtk) tool.

<details markdown="1">
<summary>Output files</summary>

- `blobtoolkit/`
- `plots/`
- `*.png` or `*.svg`, depending on the selected output format: static versions of the BlobToolKit plots.

</details>

### BUSCO

BUSCO results generated by the pipeline (all BUSCO lineages that match the claassification of the species).

<details markdown="1">
<summary>Output files</summary>

- `blobtoolkit/`
- `busco/`
- `*.batch_summary.txt`: BUSCO scores as tab-separated files (1 file per lineage).
- `*.fasta.txt`: BUSCO scores as formatted text (1 file per lineage).
- `*.json`: BUSCO scores as JSON (1 file per lineage).
- `*/`: all output BUSCO files, including the coordinate and sequence files of the annotated genes.

</details>

### MultiQC

<details markdown="1">
Expand Down
8 changes: 4 additions & 4 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,8 +229,8 @@ List of tools for any given dataset can be fetched from the API, for example htt

| Dependency | Snakemake | Nextflow |
| ----------------- | --------- | -------- |
| blobtoolkit | 4.3.2 | 4.3.2 |
| blast | 2.12.0 | 2.14.1 |
| blobtoolkit | 4.3.2 | 4.3.9 |
| blast | 2.12.0 | 2.15.0 |
| blobtk | 0.5.0 | 0.5.1 |
| busco | 5.3.2 | 5.5.0 |
| diamond | 2.0.15 | 2.1.8 |
Expand All @@ -240,8 +240,8 @@ List of tools for any given dataset can be fetched from the API, for example htt
| ncbi-datasets-cli | 14.1.0 | |
| nextflow | | 23.10.0 |
| python | 3.9.13 | 3.12.0 |
| samtools | 1.15.1 | 1.18 |
| seqtk | 1.3 | |
| samtools | 1.15.1 | 1.19.2 |
| seqtk | 1.3 | 1.4 |
| snakemake | 7.19.1 | |
| windowmasker | 2.12.0 | 2.14.0 |

Expand Down
Loading

0 comments on commit f519c4e

Please sign in to comment.