-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Formatting custom SAM header #21
Conversation
Formatting custom SAM header
This PR is against the
|
|
fixed base from main to dev |
(tip: when you create a PR, you can add some text like "Closes #9" or select the issue in the "Development" menu on the right-hand side. That'll update the board automatically and move the issue to "In progress" and "Done") |
Good tip! I was wondering why it didn't update the board...
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good and clean. I've got these comments:
- You're calling
PREPARE_HEADER
twice, once on each.dict
file. But the.dict
files only differ because theirUR
field point to separate Fasta files, and after you've editedUR
as per the "SOURCE", they become identical. We can probably just compute the header once ? - The "SOURCE" that you write down in the
UR
field is the path of the directory on the NCBI FTP. I think it should rather be a path to the file. - Finally, I think we should un-publish the samtools dict file because the new header is a richer version of it
The only reason for this was that I was worried that in the case of any hard-masking the checksum values written to the dict files could differ? If we know that only soft-masking will ever be run than I don't think that should be possible though |
Ok all issues dealt with -- removed the 2nd reheader call as I realised the checksums will always be identical between the two. Ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on AN
vs AM
! I didn't spot that.
My comments mostly all about continuing the simplification of non generating headers for the masked file.
Co-authored-by: Matthieu Muffato <[email protected]>
Closes #9
Added new sub workflow which formats a new output file *.header.sam containing the metadata items requested in #9.
The new file contains the ftp source URL (UR:), species name (SP:), GCA accession (AS:), and alternate name (AN:) taken from the Assigned-Molecule column in the assembly/{genome}.assembly_report.txt files.
Other changes:
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).