Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

completed permanentFail #168

Open
CeciliaDeng opened this issue Oct 29, 2024 · 11 comments
Open

completed permanentFail #168

CeciliaDeng opened this issue Oct 29, 2024 · 11 comments
Labels
bug Something isn't working
Milestone

Comments

@CeciliaDeng
Copy link
Collaborator

Description of the bug

Hi @GallVp ,

The assemblyQC pipeline failed on a set of transcriptome assemblies. The .nextflow.log ended with DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye, while the slurm log showed:

[job seqtransform_step] completed permanentFail
[step seqtransform_step] completed permanentFail
[workflow GenerateCleanedFasta] completed permanentFail
[step GenerateCleanedFasta] completed permanentFail
[workflow ] completed permanentFail

This issue is potentially related to SeqID issue in NCBI FCS. '>SeqID with_notes' is common in fasta files.

Thank you.

Command used and terminal output

cd $path/to/assemblyqc
sbatch pfr_assemblyqc_Resume

Relevant files

No response

System information

No response

@CeciliaDeng CeciliaDeng added the bug Something isn't working label Oct 29, 2024
@CeciliaDeng
Copy link
Collaborator Author

Setting "ncbi_fcs_adaptor_skip" and "ncbi_fcs_gx_skip" to true, the pipeline worked okay.

@GallVp
Copy link
Member

GallVp commented Oct 30, 2024

Setting "ncbi_fcs_adaptor_skip" and "ncbi_fcs_gx_skip" to true, the pipeline worked okay.

Yes, because you skipped the tool which was failing?

I'll try to reproduce this issue and possibly fix it by sanitising the fasta header.

@GallVp GallVp added this to the 2.2.0 milestone Oct 30, 2024
@SarahBailey1998
Copy link

Hi @CeciliaDeng @GallVp

I got the same error and when I checked the .command.log in the working directory I saw:

>h1tg000112l_1
        WARNING: Too many Ns in sequence: 17557 out of 17557 = 100.0%

Then when I check that sequence in the genome assembly I was checking it is actually 100% N's:

>h1tg000112l_1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN.....

I used hifiasm and purge dups to produce this genome assembly so I am guessing one of those tools isn't working properly.

@GallVp
Copy link
Member

GallVp commented Nov 3, 2024

Hi @SarahBailey1998

Thank you for the post. That's very useful.

I'll soon start work on this issue and will include a test case to check if I have fixed it.

In the case of 100% NNNN..., a simple solution is that the pipeline removes them before passing the fasta to fcs adaptor check because these sequences clearly don't have any contamination.

@SarahBailey1998
Copy link

Thanks @GallVp

I actually found this error useful because I wasn't aware that those contigs were just N's. Maybe if the pipeline doesn't include them in the adaptor check a note is made that they were there?

@GallVp
Copy link
Member

GallVp commented Nov 3, 2024

@SarahBailey1998
Do you think it should be a validation failure then? Because it does not make sense to have 100% NNNs in a sequence.

@SarahBailey1998
Copy link

Yeah I think so, it definitely makes no sense

@GallVp
Copy link
Member

GallVp commented Nov 3, 2024

Yeah I think so, it definitely makes no sense

Thanks. I will track this objective under #173

@rosscrowhurst
Copy link
Collaborator

rosscrowhurst commented Nov 3, 2024

Contigs entirely composed of Ns should not be created in the first place by the assembler - why it did that should be investigated. Introducing a check of N count vs contig length is easy to do but also removal of contigs with greater than x% Ns or greater than x% of unpolished based (likely for contigs that receive some orom of polishing during assembly) would also remove them.

@SarahBailey1998
Copy link

Thanks @rosscrowhurst

I was surprised to discover these problem contigs and am investigating the cause

@GallVp GallVp modified the milestones: 2.2.0, 2.3.0 Nov 4, 2024
@CeciliaDeng
Copy link
Collaborator Author

Hi @SarahBailey1998, in my case the inputs were transcript sequences and not many Ns present there. All the seqID lines have additional information (eg. ">g43.t1 type=CDS; aalen=194,100%,complete"). I suspect that caused the failure of NCBI FCS tools.

@GallVp GallVp removed their assignment Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants