Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for production, March 2024 #95

Merged
merged 36 commits into from
Apr 17, 2024
Merged

Fixes for production, March 2024 #95

merged 36 commits into from
Apr 17, 2024

Conversation

muffato
Copy link
Member

@muffato muffato commented Mar 26, 2024

TOLIT-2021

This pull-request includes all the changes I'm planning for a v0.4 release. It is mostly about:

  • fixing errors faced in production
  • handling large genomes
  • tidying up the generation of the output directories

Summary of the changes:

  • Upgraded all nf-core modules
  • Use the newer genomehubs/blobtoolkit Docker image
  • To address Updates in place and broken Nextflow job cache #90, I have updated the modules that contribute to the blobdir to not update their input blobdir in place. However, complete cache reusability is not achievable because the blobdir needs the list of software in a Yaml file, which comes from a Nextflow collectFile call which is not reusable (each call creates a new temporary file)
  • Files in the blobdir are now compressed (181 MB → 225 KB on the cricket blobdir)
  • Only the relevant output Busco files are published, and the sequences are tar.gz-ed
  • The busco_diamond_blastp.nf subworkflow is completely restructured to allow the above
  • To handle large genomes, I brought some configuration bits from the read-mapping and genome-note pipeline. To make things unambiguous, I have removed the process_* label from all modules that have their own withName entry. That's what most of the .diff files are for. After complete optimisation (TOLIT-1931), since there won't be any withLabel process_* in conf/base.config, we'll be able to undo most of those .diff
  • This required bringing a few steps to get the genome size and the read counts at the start of the pipeline
  • The Busco settings are slightly different from the genome-note pipeline as we're finding that interrupted runs (MEMLIMIT/RUNLIMIT) may leave a lot of temporary files, which on the Ubuntu 18.04 farms, take disk space + RAM and prevents other jobs from running on the machines. Once confirmed they work well, they should be backported to the genome-note pipeline.
  • Quick optimisation: I've patched the seqtk/subseq module to not compress the Fasta file since we had GUNZIP to uncompress it right after
  • More trace fields by default (the same as in the other pipelines)
  • More complete and accurate list of recognised file extensions for the reads

Test runs

Initially failures:

/lustre/scratch123/tol/share/weskit/data/prod/5267/5267b490-b605-4afc-a27d-8d0faf02a932
/lustre/scratch123/tol/share/weskit/data/prod/e3bb/e3bb1a93-c08f-4c25-b8b0-fdd8a945a6f8
/lustre/scratch123/tol/share/weskit/data/prod/5504/55048677-c46b-456a-a624-3c6e17331617
/lustre/scratch123/tol/share/weskit/data/prod/5363/5363f999-4a99-4301-b53b-a196cc6b72e2
/lustre/scratch123/tol/share/weskit/data/prod/dda7/dda74568-7e9a-482c-bd44-b220c885051e

And my runs on the same input data:

/lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/btk/TOLIT-2008/5267
/lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/btk/TOLIT-2008/dda7
/lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/btk/TOLIT-2008/5504
/lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/btk/TOLIT-2008/5363
/lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/btk/TOLIT-2008/dda7

I've also tried three assemblies that had failed previously and left some files in /tmp.

/lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/btk/GCA_946902985.2  # still running
/lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/btk/GCA_963580285.1
/lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/btk/GCA_963921795.1

and the full test

/lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/btk/results_full

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@muffato muffato self-assigned this Mar 26, 2024
Copy link

github-actions bot commented Mar 26, 2024

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 8af7fa8

+| ✅ 134 tests passed       |+
#| ❔  23 tests were ignored |#
!| ❗   1 tests had warnings |!

❗ Test warnings:

❔ Tests ignored:

  • files_exist - File is ignored: CODE_OF_CONDUCT.md
  • files_exist - File is ignored: assets/nf-core-blobtoolkit_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-blobtoolkit_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-blobtoolkit_logo_dark.png
  • files_exist - File is ignored: .github/ISSUE_TEMPLATE/config.yml
  • files_exist - File is ignored: .github/workflows/awstest.yml
  • files_exist - File is ignored: .github/workflows/awsfulltest.yml
  • files_exist - File is ignored: conf/igenomes.config
  • nextflow_config - Config variable ignored: manifest.name
  • nextflow_config - Config variable ignored: manifest.homePage
  • files_unchanged - File ignored due to lint config: CODE_OF_CONDUCT.md
  • files_unchanged - File ignored due to lint config: LICENSE or LICENSE.md or LICENCE or LICENCE.md
  • files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/bug_report.yml
  • files_unchanged - File does not exist: .github/ISSUE_TEMPLATE/config.yml
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
  • files_unchanged - File ignored due to lint config: .github/workflows/linting.yml
  • files_unchanged - File ignored due to lint config: assets/nf-core-blobtoolkit_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-blobtoolkit_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-blobtoolkit_logo_dark.png
  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/blobtoolkit/blobtoolkit/.github/workflows/awstest.yml
  • template_strings - template_strings
  • merge_markers - merge_markers

✅ Tests passed:

Run details

  • nf-core/tools version 2.11
  • Run at 2024-04-17 09:22:50

@muffato muffato requested a review from BethYates April 10, 2024 12:54
@muffato muffato marked this pull request as ready for review April 10, 2024 12:54
@muffato muffato merged commit f519c4e into main Apr 17, 2024
6 checks passed
@muffato muffato deleted the prod_fix branch April 17, 2024 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

summary.json missing from the output blobdir Updates in place and broken Nextflow job cache
2 participants