Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formatting custom SAM header #21

Merged
merged 17 commits into from
Jul 5, 2024
Merged

Formatting custom SAM header #21

merged 17 commits into from
Jul 5, 2024

Conversation

tkchafin
Copy link
Contributor

@tkchafin tkchafin commented Jun 26, 2024

Closes #9

Added new sub workflow which formats a new output file *.header.sam containing the metadata items requested in #9.

The new file contains the ftp source URL (UR:), species name (SP:), GCA accession (AS:), and alternate name (AN:) taken from the Assigned-Molecule column in the assembly/{genome}.assembly_report.txt files.

Other changes:

  • Changes to files written/emitted by download_ sub workflow
  • small typos in USAGE.md

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Copy link

This PR is against the main branch ❌

  • Do not close this PR
  • Click Edit and change the base to dev
  • This CI test will remain failed until you push a new commit

Hi @tkchafin,

It looks like this pull-request is has been made against the sanger-tol/insdcdownload main branch.
The main branch on nf-core repositories should always contain code from the latest release.
Because of this, PRs to main are only allowed if they come from the sanger-tol/insdcdownload dev branch.

You do not need to close this PR, you can change the target branch to dev by clicking the "Edit" button at the top of this page.
Note that even after this, the test will continue to show as failing until you push a new commit.

Thanks again for your contribution!

@tkchafin tkchafin requested a review from muffato June 26, 2024 07:56
Copy link

github-actions bot commented Jun 26, 2024

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 5258712

+| ✅ 129 tests passed       |+
#| ❔  23 tests were ignored |#
!| ❗   1 tests had warnings |!

❗ Test warnings:

❔ Tests ignored:

  • files_exist - File is ignored: assets/multiqc_config.yml
  • files_exist - File is ignored: assets/nf-core-insdcdownload_logo_light.png
  • files_exist - File is ignored: conf/igenomes.config
  • files_exist - File is ignored: docs/images/nf-core-insdcdownload_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-insdcdownload_logo_dark.png
  • files_exist - File is ignored: .github/ISSUE_TEMPLATE/config.yml
  • files_exist - File is ignored: .github/workflows/awstest.yml
  • files_exist - File is ignored: .github/workflows/awsfulltest.yml
  • nextflow_config - Config variable ignored: manifest.name
  • nextflow_config - Config variable ignored: manifest.homePage
  • files_unchanged - File ignored due to lint config: LICENSE or LICENSE.md or LICENCE or LICENCE.md
  • files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md
  • files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/bug_report.yml
  • files_unchanged - File does not exist: .github/ISSUE_TEMPLATE/config.yml
  • files_unchanged - File ignored due to lint config: .github/workflows/branch.yml
  • files_unchanged - File ignored due to lint config: .github/workflows/linting.yml
  • files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
  • files_unchanged - File does not exist: assets/nf-core-insdcdownload_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-insdcdownload_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-insdcdownload_logo_dark.png
  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/insdcdownload/insdcdownload/.github/workflows/awstest.yml
  • multiqc_config - 'assets/multiqc_config.yml' not found

✅ Tests passed:

Run details

  • nf-core/tools version 2.8
  • Run at 2024-07-03 10:22:47

@tkchafin tkchafin changed the base branch from main to dev June 26, 2024 08:21
@tkchafin
Copy link
Contributor Author

fixed base from main to dev

@muffato
Copy link
Member

muffato commented Jun 26, 2024

(tip: when you create a PR, you can add some text like "Closes #9" or select the issue in the "Development" menu on the right-hand side. That'll update the board automatically and move the issue to "In progress" and "Done")

@tkchafin
Copy link
Contributor Author

tkchafin commented Jun 26, 2024 via email

Copy link
Member

@muffato muffato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good and clean. I've got these comments:

  1. You're calling PREPARE_HEADER twice, once on each .dict file. But the .dict files only differ because their UR field point to separate Fasta files, and after you've edited UR as per the "SOURCE", they become identical. We can probably just compute the header once ?
  2. The "SOURCE" that you write down in the UR field is the path of the directory on the NCBI FTP. I think it should rather be a path to the file.
  3. Finally, I think we should un-publish the samtools dict file because the new header is a richer version of it

modules/local/build_sam_header.nf Outdated Show resolved Hide resolved
@tkchafin tkchafin linked an issue Jun 27, 2024 that may be closed by this pull request
@tkchafin
Copy link
Contributor Author

tkchafin commented Jun 27, 2024

That looks good and clean. I've got these comments:

  1. You're calling PREPARE_HEADER twice, once on each .dict file. But the .dict files only differ because their UR field point to separate Fasta files, and after you've edited UR as per the "SOURCE", they become identical. We can probably just compute the header once ?

The only reason for this was that I was worried that in the case of any hard-masking the checksum values written to the dict files could differ? If we know that only soft-masking will ever be run than I don't think that should be possible though

@tkchafin
Copy link
Contributor Author

Ok all issues dealt with -- removed the 2nd reheader call as I realised the checksums will always be identical between the two. Ready for review!

Copy link
Member

@muffato muffato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on AN vs AM ! I didn't spot that.

My comments mostly all about continuing the simplification of non generating headers for the masked file.

subworkflows/local/prepare_header.nf Outdated Show resolved Hide resolved
subworkflows/local/prepare_header.nf Outdated Show resolved Hide resolved
conf/modules.config Outdated Show resolved Hide resolved
modules/local/build_sam_header.nf Outdated Show resolved Hide resolved
@tkchafin tkchafin self-assigned this Jul 3, 2024
@tkchafin tkchafin merged commit 96e820a into dev Jul 5, 2024
6 checks passed
@tkchafin tkchafin deleted the samtools_reheader branch July 5, 2024 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pull additional metadata as header file
2 participants