Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for production, March 2024 #95

Merged
merged 36 commits into from
Apr 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
196427a
Upgraded the BTK container version to enable better reporting of a wr…
muffato Feb 9, 2024
b65e399
Updated the minimap2/align module
muffato Feb 9, 2024
0f509b4
Same format for the trace files as the readmapping and genomenote pip…
muffato Feb 26, 2024
0e3a4a0
bugfix: put the summary.json in the blobdir so that it is published too
muffato Mar 15, 2024
fc7c576
Upgraded the BTK container version to handle None in CSV files
muffato Mar 22, 2024
f04b749
Make minimap2 support large genomes
muffato Mar 22, 2024
40fbde5
Pull the genome size as a meta field to allow usage in every process
muffato Mar 22, 2024
46d231d
Don't modify the blobdir in place
muffato Mar 22, 2024
bb428b2
Compress the blobdir
muffato Mar 22, 2024
9ee35c0
Updated the changelog
muffato Mar 22, 2024
2c1bd1f
Updated the documentation
muffato Mar 22, 2024
d7fa228
Version bump
muffato Mar 22, 2024
b53b497
Updated the resource requirements of fasta_windows
muffato Mar 22, 2024
f99b542
Updated the resource requirements of BUSCO
muffato Mar 22, 2024
72a639b
Updated the resource requirements of WINDOWSTATS_INPUT
muffato Mar 22, 2024
54862de
Updated the resource requirements of BLOBTOOLKIT_WINDOWSTATS
muffato Mar 22, 2024
424d7e0
Added missing FastQ extensions
muffato Mar 22, 2024
f8c3cd0
Also support FastA files
muffato Mar 22, 2024
8f7e029
Updated the input JSON
muffato Mar 22, 2024
75bbc21
Updated the CHANGELOG
muffato Mar 22, 2024
7dd24b3
Only BAM and CRAM are accepted for aligned reads
muffato Mar 22, 2024
a71b759
There is no patch for cat/cat
muffato Mar 22, 2024
857d094
Updated all nf-core modules
muffato Mar 22, 2024
ff43207
No need to compress the file if we need to decompress it right after
muffato Mar 22, 2024
b967b51
Count the number of reads in each input file
muffato Mar 22, 2024
bf82511
Optimised settings for minimap2, taken from the read-mapping pipeline
muffato Mar 22, 2024
fd86c4e
Updated these versions too
muffato Mar 26, 2024
c8253f5
Slightly increased the runtime
muffato Mar 26, 2024
eecf569
Missing space
muffato Mar 28, 2024
cf1dc98
bugfix: meta needs to have the lineage information so that the join c…
muffato Apr 2, 2024
328e905
bugfix: these files can be missing if no gene is found
muffato Apr 2, 2024
4c723c7
Tidy up the busco directory before publication
muffato Apr 2, 2024
e273433
Actually need more resources for BUSCO
muffato Apr 2, 2024
f4e236f
Better explanation
muffato Apr 8, 2024
7417c5b
Updated the BUSCO resources
muffato Apr 9, 2024
8af7fa8
Updated the README
muffato Apr 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,35 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [[0.4.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.4.0)] – Buneary – [2024-03-28]

The pipeline has now been validated on dozens of genomes, up to 11 Gbp.

### Enhancements & fixes

- Upgraded the version of `blobtools`, which enables a better reporting of
wrong accession numbers and a better handling of oddities in input files.
- Files in the output blobdir are now compressed.
- All modules handling blobdirs can now be cached.
- Large genomes supported, up to at least 11 Gbp.
- Allow all variations of FASTA and FASTQ extensions for input.
- More fields included in the trace files.
- All nf-core modules updated

### Software dependencies

Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.

| Dependency | Old version | New version |
| ----------- | ------------- | ------------- |
| blobtoolkit | 4.3.3 | 4.3.9 |
| blast | 2.14.0 | 2.15.0 |
| multiqc | 1.17 and 1.18 | 1.20 and 1.21 |
| samtools | 1.18 | 1.19.2 |
| seqtk | 1.3 | 1.4 |

> **NB:** Dependency has been **updated** if both old and new version information is present. </br> **NB:** Dependency has been **added** if just the new version information is present. </br> **NB:** Dependency has been **removed** if version information isn't present.

## [[0.3.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.3.0)] – Poliwag – [2024-02-09]

The pipeline has now been validated on five genomes, all under 100 Mbp: a
Expand Down Expand Up @@ -33,6 +62,16 @@ sponge, a platyhelminth, and three fungi.

> **NB:** Parameter has been **updated** if both old and new parameter information is present. </br> **NB:** Parameter has been **added** if just the new parameter information is present. </br> **NB:** Parameter has been **removed** if new parameter information isn't present.

### Software dependencies

Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.

| Dependency | Old version | New version |
| ----------- | ----------- | ----------- |
| blobtoolkit | 4.3.2 | 4.3.3 |

> **NB:** Dependency has been **updated** if both old and new version information is present. </br> **NB:** Dependency has been **added** if just the new version information is present. </br> **NB:** Dependency has been **removed** if version information isn't present.

## [[0.2.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.2.0)] – Pikachu – [2023-12-22]

### Enhancements & fixes
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@

## Introduction

**sanger-tol/blobtoolkit** is a bioinformatics pipeline that can be used to identify and analyse non-target DNA for eukaryotic genomes. It takes a samplesheet and aligned CRAM files as input, calculates genome statistics, coverage and completeness information, combines them in a TSV file by window size to create a BlobDir dataset and static plots.
**sanger-tol/blobtoolkit** is a bioinformatics pipeline that can be used to identify and analyse non-target DNA for eukaryotic genomes.
It takes a samplesheet of BAM/CRAM/FASTQ/FASTA files as input, calculates genome statistics, coverage and completeness information, combines them in a TSV file by window size to create a BlobDir dataset and static plots.

1. Calculate genome statistics in windows ([`fastawindows`](https://github.com/tolkit/fasta_windows))
2. Calculate Coverage ([`blobtk/depth`](https://github.com/blobtoolkit/blobtk))
Expand Down
4 changes: 2 additions & 2 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@
},
"datafile": {
"type": "string",
"pattern": "^\\S+\\.cram$",
"errorMessage": "Data file for reads cannot contain spaces and must have extension 'cram'"
"pattern": "^\\S+\\.(bam|cram|fa|fa.gz|fasta|fasta.gz|fq|fq.gz|fastq|fastq.gz)$",
"errorMessage": "Data file for reads cannot contain spaces and must be BAM/CRAM/FASTQ/FASTA"
}
},
"required": ["datafile", "datatype", "sample"]
Expand Down
6 changes: 6 additions & 0 deletions bin/check_samplesheet.py
muffato marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,14 @@ class RowChecker:
VALID_FORMATS = (
".cram",
".bam",
".fq",
".fq.gz",
".fastq",
".fastq.gz",
".fa",
".fa.gz",
".fasta",
".fasta.gz",
)

VALID_DATATYPES = (
Expand Down
9 changes: 5 additions & 4 deletions bin/update_versions.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,10 @@ def parse_args(args=None):
Description = "Combine BED files to create window stats input file."

parser = argparse.ArgumentParser(description=Description)
parser.add_argument("--meta", help="Input JSON file.", required=True)
parser.add_argument("--meta_in", help="Input JSON file.", required=True)
parser.add_argument("--meta_out", help="Output JSON file.", required=True)
parser.add_argument("--software", help="Input YAML file.", required=True)
parser.add_argument("--version", action="version", version="%(prog)s 1.0.0")
parser.add_argument("--version", action="version", version="%(prog)s 1.1.0")
return parser.parse_args(args)


Expand All @@ -41,8 +42,8 @@ def update_meta(meta, software):
def main(args=None):
args = parse_args(args)

data = update_meta(args.meta, args.software)
with open(args.meta, "w") as fh:
data = update_meta(args.meta_in, args.software)
with open(args.meta_out, "w") as fh:
json.dump(data, fh)


Expand Down
52 changes: 52 additions & 0 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,58 @@ process {
withLabel:process_high_memory {
memory = { check_max( 200.GB * task.attempt, 'memory' ) }
}

withName: '.*:MINIMAP2_ALIGNMENT:MINIMAP2_CCS' {
cpus = { log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) }
memory = { check_max( 800.MB * log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) + 14.GB * Math.ceil( Math.pow(meta2.genome_size / 1000000000, 0.6)) * task.attempt, 'memory' ) }
time = { check_max( 4.h * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'time' ) }
}

// Extrapolated from the HIFI settings on the basis of 1 ONT alignment. CLR assumed to behave the same way as ONT
withName: '.*:MINIMAP2_ALIGNMENT:MINIMAP2_(CLR|ONT)' {
cpus = { log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) }
memory = { check_max( 800.MB * log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) + 30.GB * Math.ceil( Math.pow(meta2.genome_size / 1000000000, 0.6)) * task.attempt, 'memory' ) }
time = { check_max( 1.h * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'time' ) }
}

// Temporarily the same settings as CCS
withName: '.*:MINIMAP2_ALIGNMENT:MINIMAP2_(HIC|ILMN)' {
cpus = { log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) }
memory = { check_max( 800.MB * log_increase_cpus(4, 2*task.attempt, meta.read_count/1000000, 2) + 14.GB * Math.ceil( Math.pow(meta2.genome_size / 1000000000, 0.6)) * task.attempt, 'memory' ) }
time = { check_max( 3.h * Math.ceil( meta.read_count / 1000000 ) * task.attempt, 'time' ) }
}

withName: 'WINDOWSTATS_INPUT' {
cpus = { check_max( 1 , 'cpus' ) }
// 2 GB per 1 Gbp
memory = { check_max( 2.GB * task.attempt * Math.ceil(meta.genome_size / 1000000000), 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
}

withName: 'BLOBTOOLKIT_WINDOWSTATS' {
cpus = { check_max( 1 , 'cpus' ) }
// 3 GB per 1 Gbp
memory = { check_max( 3.GB * task.attempt * Math.ceil(meta.genome_size / 1000000000), 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
}

withName: 'FASTAWINDOWS' {
// 1 CPU per 1 Gbp
cpus = { check_max( Math.ceil(meta.genome_size / 1000000000), 'cpus' ) }
// 100 MB per 45 Mbp
memory = { check_max( 100.MB * task.attempt * Math.ceil(meta.genome_size / 45000000), 'memory' ) }
}

withName: BUSCO {
// The formulas below are equivalent to these ranges:
// Gbp: [ 1, 2, 4, 8, 16]
// CPUs: [ 8, 12, 16, 20, 24]
// GB RAM: [16, 32, 64, 128, 256]
memory = { check_max( 1.GB * Math.pow(2, 3 + task.attempt + Math.ceil(positive_log(meta.genome_size/1000000000, 2))) , 'memory' ) }
cpus = { log_increase_cpus(4, 4*task.attempt, Math.ceil(meta.genome_size/1000000000), 2) }
time = { check_max( 3.h * Math.ceil(meta.genome_size/1000000000) * task.attempt, 'time') }
}

withName:CUSTOM_DUMPSOFTWAREVERSIONS {
cache = false
}
Expand Down
39 changes: 17 additions & 22 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -29,23 +29,23 @@ process {
}

withName: "MINIMAP2_HIC" {
ext.args = "-ax sr"
ext.args = { "-ax sr -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
}

withName: "MINIMAP2_ILMN" {
ext.args = "-ax sr"
ext.args = { "-ax sr -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
}

withName: "MINIMAP2_CCS" {
ext.args = "-ax map-hifi --cs=short"
ext.args = { "-ax map-hifi --cs=short -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
}

withName: "MINIMAP2_CLR" {
ext.args = "-ax map-pb"
ext.args = { "-ax map-pb -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
}

withName: "MINIMAP2_ONT" {
ext.args = "-ax map-ont"
ext.args = { "-ax map-ont -I" + Math.ceil(meta2.genome_size/1e9) + 'G' }
}

withName: "SAMTOOLS_VIEW" {
Expand All @@ -67,6 +67,9 @@ process {
// Note: BUSCO *must* see the double-quotes around the parameters
'--force --metaeuk_parameters \'"-s=2"\' --metaeuk_rerun_parameters \'"-s=2"\''
: '--force' }
}

withName: "RESTRUCTUREBUSCODIR" {
publishDir = [
path: { "${params.outdir}/busco" },
mode: params.publish_dir_mode,
Expand Down Expand Up @@ -98,22 +101,6 @@ process {
ext.args = "--evalue 1.0e-25 --hit-count 10"
}

withName: "BLOBTOOLKIT_SUMMARY" {
publishDir = [
path: { "${params.outdir}/blobtoolkit/${blobdir.name}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals("versions.yml") ? null : filename }
]
}

withName: "BLOBTK_IMAGES" {
publishDir = [
path: { "${params.outdir}/blobtoolkit/plots" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals("versions.yml") ? null : filename }
]
}

withName: "BLOBTOOLKIT_CHUNK" {
ext.args = "--chunk 100000 --overlap 0 --max-chunks 10 --min-length 1000"
}
Expand All @@ -138,14 +125,22 @@ process {
]
}

withName: "BLOBTOOLKIT_UPDATEMETA" {
withName: "COMPRESSBLOBDIR" {
publishDir = [
path: { "${params.outdir}/blobtoolkit" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals("versions.yml") ? null : filename }
]
}

withName: "BLOBTK_IMAGES" {
publishDir = [
path: { "${params.outdir}/blobtoolkit/plots" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals("versions.yml") ? null : filename }
]
}

withName: 'MULTIQC' {
ext.args = { params.multiqc_title ? "--title \"$params.multiqc_title\"" : '' }
publishDir = [
Expand Down
41 changes: 35 additions & 6 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ The directories listed below will be created in the results directory after the

The directories comply with Tree of Life's canonical directory structure.

<!-- Write this documentation describing your workflow's output -->

## Pipeline overview

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [BlobDir](#blobdir) - Output files from `blobtools` and `view` subworkflow
- [BlobDir](#blobdir) - Output files viewable on a [BlobToolKit viewer](https://github.com/blobtoolkit/blobtoolkit)
- [Static plots](#static-plots) - Static versions of the BlobToolKit plots
- [BUSCO](#busco) - BUSCO results
- [MultiQC](#multiqc) - Aggregate report describing results from the whole pipeline
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution

Expand All @@ -25,14 +25,43 @@ The files in the BlobDir dataset which is used to create the online interactive
<details markdown="1">
<summary>Output files</summary>

- `<accession>/`
- `*.json`: files generated from genome and alignment coverage statistics
- `*.png`: static plot images
- `blobtoolkit/`
- `<accession>/`
- `*.json.gz`: files generated from genome and alignment coverage statistics

More information about visualising the data in the [BlobToolKit repository](https://github.com/blobtoolkit/blobtoolkit/tree/main/src/viewer)

</details>

### Static plots

Images generated from the above blobdir using the [blobtk](https://github.com/blobtoolkit/blobtk) tool.

<details markdown="1">
<summary>Output files</summary>

- `blobtoolkit/`
- `plots/`
- `*.png` or `*.svg`, depending on the selected output format: static versions of the BlobToolKit plots.

</details>

### BUSCO

BUSCO results generated by the pipeline (all BUSCO lineages that match the claassification of the species).

<details markdown="1">
<summary>Output files</summary>

- `blobtoolkit/`
- `busco/`
- `*.batch_summary.txt`: BUSCO scores as tab-separated files (1 file per lineage).
- `*.fasta.txt`: BUSCO scores as formatted text (1 file per lineage).
- `*.json`: BUSCO scores as JSON (1 file per lineage).
- `*/`: all output BUSCO files, including the coordinate and sequence files of the annotated genes.

</details>

### MultiQC

<details markdown="1">
Expand Down
8 changes: 4 additions & 4 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,8 +229,8 @@ List of tools for any given dataset can be fetched from the API, for example htt

| Dependency | Snakemake | Nextflow |
| ----------------- | --------- | -------- |
| blobtoolkit | 4.3.2 | 4.3.2 |
| blast | 2.12.0 | 2.14.1 |
| blobtoolkit | 4.3.2 | 4.3.9 |
| blast | 2.12.0 | 2.15.0 |
| blobtk | 0.5.0 | 0.5.1 |
| busco | 5.3.2 | 5.5.0 |
| diamond | 2.0.15 | 2.1.8 |
Expand All @@ -240,8 +240,8 @@ List of tools for any given dataset can be fetched from the API, for example htt
| ncbi-datasets-cli | 14.1.0 | |
| nextflow | | 23.10.0 |
| python | 3.9.13 | 3.12.0 |
| samtools | 1.15.1 | 1.18 |
| seqtk | 1.3 | |
| samtools | 1.15.1 | 1.19.2 |
| seqtk | 1.3 | 1.4 |
| snakemake | 7.19.1 | |
| windowmasker | 2.12.0 | 2.14.0 |

Expand Down
Loading
Loading