Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pacbio align #63

Merged
merged 33 commits into from
Dec 8, 2023
Merged

Pacbio align #63

merged 33 commits into from
Dec 8, 2023

Conversation

gq1
Copy link
Member

@gq1 gq1 commented Nov 30, 2023

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Copy link

github-actions bot commented Nov 30, 2023

nf-core lint overall result: Passed ✅

Posted for pipeline commit 45a7ba4

+| ✅ 126 tests passed       |+
#| ❔  26 tests were ignored |#

❔ Tests ignored:

  • files_exist - File is ignored: CODE_OF_CONDUCT.md
  • files_exist - File is ignored: assets/nf-core-variantcalling_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-variantcalling_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-variantcalling_logo_dark.png
  • files_exist - File is ignored: .github/ISSUE_TEMPLATE/config.yml
  • files_exist - File is ignored: .github/workflows/awstest.yml
  • files_exist - File is ignored: .github/workflows/awsfulltest.yml
  • files_exist - File is ignored: assets/multiqc_config.yml
  • files_exist - File is ignored: conf/igenomes.config
  • nextflow_config - Config variable ignored: manifest.name
  • nextflow_config - Config variable ignored: manifest.homePage
  • files_unchanged - File ignored due to lint config: LICENSE or LICENSE.md or LICENCE or LICENCE.md
  • files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md
  • files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/bug_report.yml
  • files_unchanged - File does not exist: .github/ISSUE_TEMPLATE/config.yml
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
  • files_unchanged - File ignored due to lint config: .github/workflows/branch.yml
  • files_unchanged - File ignored due to lint config: .github/workflows/linting.yml
  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
  • files_unchanged - File does not exist: assets/nf-core-variantcalling_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-variantcalling_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-variantcalling_logo_dark.png
  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/variantcalling/variantcalling/.github/workflows/awstest.yml
  • multiqc_config - 'assets/multiqc_config.yml' not found

✅ Tests passed:

Run details

  • nf-core/tools version 2.8
  • Run at 2023-12-07 16:24:48

@gq1
Copy link
Member Author

gq1 commented Dec 1, 2023

The profile test_full_aln finished on the farm:

Completed at: 30-Nov-2023 21:33:10
Duration    : 9h 4m 3s
CPU hours   : 242.2
Succeeded   : 76

@gq1
Copy link
Member Author

gq1 commented Dec 4, 2023

#48

@muffato muffato linked an issue Dec 7, 2023 that may be closed by this pull request
Copy link
Member

@muffato muffato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Some alignment files have .fasta in the name, like GCA_947369205.1_OX376310.1_CANBKR010000003.1.fasta.pacbio.icCanRufa1_combined.cram. I guess it's because .baseName only removes the .gz extension. Can you do something like fai.baseName - '.fasta' (as in subworkflows/local/input_filter_split.nf) ?
  • Not from this PR, but since we're talking about the naming convention, I would prefer not adding _combined
  • Bump maxRetries to 5 in conf/base.config, until the resource requirements are backported from the readmapping pipeline

Thinking about the pipeline doing either the alignment sub-workflow or "samtools merge". There is actually a "samtools merge" in the alignment sub-workflow that does pretty much the same thing: align multiple aligned files from the same sample. Could there be just one "samtools merge" that works in both cases ? Also, in the alignment sub-workflow, "samtools merge" is followed by a bunch of samtools commands to extract some statistics and convert to CRAM. Here are my thoughts:

  1. For resource optimisation, we'll very likely have to run "samtools flagstat" on every input file to tune CPU/memory according to the read count
  2. It's maybe not a bad thing to extract all those stats from aligned files given as inputs. It may help with debugging and making sure the files are proper
  3. Is CRAM conversion happening twice ? There seems to be one in convert_stats.nf and one in input_filter_split.nf. Maybe we can remove the first one ?

In summary, and I'm happy to discuss that over a call if that's easier, here is what I would propose:

  1. Stop the alignment sub-workflow after minimap.
  2. Run samtools merge at this stage in both alignment and pre-aligned modes.
  3. The "convert stats" sub-workflow could be just "stats" and run samtools stats/flagstat/idxstat on all aligned files before input_filter_split.

nextflow_schema.json Outdated Show resolved Hide resolved
@gq1
Copy link
Member Author

gq1 commented Dec 7, 2023

  • Some alignment files have .fasta in the name, like GCA_947369205.1_OX376310.1_CANBKR010000003.1.fasta.pacbio.icCanRufa1_combined.cram. I guess it's because .baseName only removes the .gz extension. Can you do something like fai.baseName - '.fasta' (as in subworkflows/local/input_filter_split.nf) ?
  • Not from this PR, but since we're talking about the naming convention, I would prefer not adding _combined
  • Bump maxRetries to 5 in conf/base.config, until the resource requirements are backported from the readmapping pipeline

All being updated now.

@gq1
Copy link
Member Author

gq1 commented Dec 7, 2023

Currently we tried to import 3 subworkflows from readmapping pipeline and didn't change much.
In the future if we can push these subworkflows to nf-core, we can directly import them.
That is why we don't need to change much for now, keep these subworkflows same way as in other pipelines (3 now?), maybe easy to maintain for now?

Thinking about the pipeline doing either the alignment sub-workflow or "samtools merge". There is actually a "samtools merge" in the alignment sub-workflow that does pretty much the same thing: align multiple aligned files from the same sample. Could there be just one "samtools merge" that works in both cases ? Also, in the alignment sub-workflow, "samtools merge" is followed by a bunch of samtools commands to extract some statistics and convert to CRAM. Here are my thoughts:

  1. For resource optimisation, we'll very likely have to run "samtools flagstat" on every input file to tune CPU/memory according to the read count
  2. It's maybe not a bad thing to extract all those stats from aligned files given as inputs. It may help with debugging and making sure the files are proper

I suppose the aligned input bam/cram file will come with the stats file normally.

  1. Is CRAM conversion happening twice ? There seems to be one in convert_stats.nf and one in input_filter_split.nf. Maybe we can remove the first one ?

The first process only convert the aligned reads to CRAM and the second one will filter the reads as well.
We can skip the first one if we don't want to keep the original aligned reads.

@gq1
Copy link
Member Author

gq1 commented Dec 8, 2023

The latest results here from nextflow run variantcalling -profile test_full_align,sanger,singularity --align

/nfs/users/nfs_g/gq2/lustre123_gq2/git/results

@muffato
Copy link
Member

muffato commented Dec 8, 2023

We discussed this this morning.

  • Contrary to the blobtoolkit pipeline, the alignments done here are the best we can. So they're worth being exposed to the results directory in CRAM. Let's keep that SAMTOOLS_VIEW
  • Sub-workflows will be reviewed and deposited in nf-core next year. In the meantime, let's keep the ones you copied from readmapping

@gq1 gq1 merged commit 5f528f7 into dev Dec 8, 2023
6 checks passed
@muffato muffato deleted the pacbio_align branch December 8, 2023 22:38
@muffato
Copy link
Member

muffato commented Dec 8, 2023

3. The "convert stats" sub-workflow could be just "stats" and run samtools stats/flagstat/idxstat on all aligned files before input_filter_split.

FYI. Priyanka had the same idea for the read-mapping pipeline: sanger-tol/readmapping#63

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add optional read mapping subworkflow
2 participants