Non-unique of `sample` values across several patients causes name clashes #1451

hsk6328 · 2024-03-29T16:53:19Z

Description of the bug

Hello, I am very grateful for your development of the Sarek pipeline. This pipeline has been very helpful to me in handling WGS analysis. However, I encountered an error when testing the pipeline with the test dataset. I would like to ask what might have caused this error.

When I provide a pair of normal and tumor data, an error occurs when calling BAM_VARIANT_CALLING_SOMATIC_ALL in the variant_calling step. The error message is as follows:

ERROR ~ Error executing process > 'NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:MPILEUP_NORMAL:CAT_MPILEUP (1)'

Caused by:
  Process `NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:MPILEUP_NORMAL:CAT_MPILEUP` input file name collision -- There are multiple input files for each of the following file names: HCC1395T_vs_HCC1395N.mpileup.gz


Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

And this is the sample.stomatic.csv:

patient,sex,status,sample,lane,fastq_1,fastq_2
HCC1395,XX,0,HCC1395N,1,./SRR7890919.10M_1.fastq.gz,./SRR7890919.10M_2.fastq.gz
HCC1395,XX,1,HCC1395T,1,./SRR7890918.10M_1.fastq.gz,./SRR7890918.10M_2.fastq.gz

This is the configuration file that I set up, with other parameters kept at default values:

params {
    config_profile_name        = 'WES Demo'
    max_cpus   = 8

    input = '/mnt/disk0/01.nf-core-pipelines/demo/sarek_3.4.0/wes.demo/sample.stomatic.csv'

    // Other params
    tools       = 'controlfreec,vep'
    split_fastq = 20000000
    intervals   = '/mnt/disk0/01.nf-core-pipelines/demo/sarek_3.4.0/wes.demo/S07604624_Padded_Agilent_SureSelectXT_allexons_V6_UTR.bed'
    wes         = true
}

Could you please provide valuable suggestions for this runtime error? Thank you very much!

Command used and terminal output

nextflow run ${nfcorePath}/nf-core-sarek_3.4.0/3_4_0 -profile singularity -c wes.conf --outdir ./outdir --genome GATK.GRCh38

Relevant files

nextflow.log

System information

Nextflow version: 23.10.1 build 5891
System: Linux 3.10.0-1160.108.1.el7.x86_64
Runtime: Groovy 3.0.19 on OpenJDK 64-Bit Server VM 11.0.22+7-LTS
Encoding: UTF-8 (ANSI_X3.4-1968)
Version of nf-core/sarek (3.4.0)

The text was updated successfully, but these errors were encountered:

brandon-hastings · 2024-08-08T21:26:04Z

I am also getting a similar input file name collision error when NFCORE_SAREK:SAREK:BAM_MARKDUPLICATES:GATK4_MARKDUPLICATES is called:

Aug-08 16:16:14.925 [Actor Thread 456] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for task: name=NFCORE_SAREK:SAREK:BAM_MARKDUPLICATES:GATK4_MARKDUPLICATES (2); work-dir=null error [nextflow.exception.ProcessUnrecoverableException]: Process NFCORE_SAREK:SAREK:BAM_MARKDUPLICATES:GATK4_MARKDUPLICATES input file name collision -- There are multiple input files for each of the following file names: Tumor1-L3.0010.bam, Tumor1-L3.0001.bam, Tumor1-L3.0007.bam, Tumor1-L3.0008.bam, Tumor1-L3.0004.bam, Tumor1-L3.0012.bam, Tumor1-L3.0003.bam, Tumor1-L3.0009.bam, Tumor1-L3.0005.bam, Tumor1-L3.0011.bam, Tumor1-L3.0002.bam, Tumor1-L3.0006.bam

nextflow.log

hsk6328 · 2024-08-08T21:26:39Z

谢谢您~我已收到邮件，稍后将会回复您~

asp8200 · 2024-08-09T06:10:41Z

I am also getting a similar input file name collision error when NFCORE_SAREK:SAREK:BAM_MARKDUPLICATES:GATK4_MARKDUPLICATES is called:

Aug-08 16:16:14.925 [Actor Thread 456] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for task: name=NFCORE_SAREK:SAREK:BAM_MARKDUPLICATES:GATK4_MARKDUPLICATES (2); work-dir=null error [nextflow.exception.ProcessUnrecoverableException]: Process NFCORE_SAREK:SAREK:BAM_MARKDUPLICATES:GATK4_MARKDUPLICATES input file name collision -- There are multiple input files for each of the following file names: Tumor1-L3.0010.bam, Tumor1-L3.0001.bam, Tumor1-L3.0007.bam, Tumor1-L3.0008.bam, Tumor1-L3.0004.bam, Tumor1-L3.0012.bam, Tumor1-L3.0003.bam, Tumor1-L3.0009.bam, Tumor1-L3.0005.bam, Tumor1-L3.0011.bam, Tumor1-L3.0002.bam, Tumor1-L3.0006.bam

nextflow.log

@brandon-hastings : Which version of Sarek are you using? Can you reproduce the error with the latest v3.4.3?

brandon-hastings · 2024-08-31T08:19:51Z

I was using v3.4.2, but I ran the pipeline again with v3.4.3 and received the same error.

hsk6328 · 2024-08-31T08:20:25Z

谢谢您~我已收到邮件，稍后将会回复您~

brandon-hastings · 2024-09-02T13:31:51Z

I was able to look into this more for my specific case. The offending files have the same name within different folders in the work directory. An example of the file structure I am seeing would be:

|-work
   |--3a
      |--82845aa9f7c8a2e485c011c0bb4a5d
         |--bwa
         |--0004.Tumor1-L3_1.fastp.fastq.gz
         |--0004.Tumor1-L3_2.fastp.fastq.gz
         |--Tumor1-L3.0004.bam
   |--5e
      |--f98724d17e629dbde43af318021266
         |--bwa
         |--0004.Tumor1-L3_1.fastp.fastq.gz
         |--0004.Tumor1-L3_2.fastp.fastq.gz
         |--Tumor1-L3.0004.bam

Comparing these bam files via cmp or md5 reveals that they are different files, even after converting them to sam format first.

It might be worth noting that the pipeline failed multiple times due to errors pulling singularity images or errors with processes exceeding running time limits and was run again using the -resume flag. This cycle continued about 10 times, so I’m not sure if the duplicate folders are due to an error in caching the pipeline progress and resuming or if this occurred during the original fastq splitting process in FASTP.

FriederikeHanssen · 2024-09-02T14:48:53Z

thanks for investigating @brandon-hastings . Wuold you be able to try and reproduce this with a completely clean work directory so we can see whether or not the work directory structure comes from cached steps?

brandon-hastings · 2024-09-03T10:03:22Z

Yes I am running the workflow again now from the beginning with a clean working directory and I can check for the duplicate directory structure if it crashes.

I have previously replicated the behavior where multiple crashes and resumes resulted in the work directory structure I presented above, one in version 3.4.2 and the other in version 3.4.3, both of which were started from the beginning of the Sarek workflow with a clean work directory.

brandon-hastings · 2024-09-19T14:41:42Z

I found that the error was caused by the naming in my sample sheet, which I have included a minimum example of as a txt file. I had unique patient IDs, but was reusing naming for sample IDs across patients which I believe led to the file naming error I saw during the FASTP split because the bam file is named using only the sample ID and lane.

samplesheet:
samplesheet_example.txt

I managed to solve it by manually adding the patient name to the beginning of each sample ID and restarting the pipeline.

hsk6328 · 2024-09-19T14:42:19Z

谢谢您~我已收到邮件，稍后将会回复您~

FriederikeHanssen · 2024-09-19T16:45:14Z

Ah interesting. thanks for investigating @brandon-hastings . I will mark this issue with the label input validation. We should add an additional validation step that makes sure sampleIDs themselves are unique for different patients. They still need to be the same within a single patient to account for multiple lanes

brandon-hastings · 2024-09-20T14:41:39Z

I just forked it to take a look for myself and I should be able to submit a pull request to include input validation regarding this issue with an updated subworkflow test sometime over the next few days.

Dorothynyamai · 2024-11-22T09:44:03Z

Hi @brandon-hastings and @FriederikeHanssen Thank you for your insights. I am having the same error even though I have the unique tumor IDs as my patient ID. The difference is that in my case I have the several lanes for the normal samples and this may be the issue. Do you have any suggestion on how I can make changes to my sample sheet to solve this error. Thank you so much

FriederikeHanssen · 2024-11-22T09:49:46Z

Could you share your samplesheet please?

Dorothynyamai · 2024-11-22T10:16:42Z

@FriederikeHanssen sure here is a snippet of my samplesheet. testsample_sheet.csv
Thank you

FriederikeHanssen · 2024-11-22T10:38:56Z

the sample columns need to be unique per patient. So you'd want to maybe do something like this: instead of "normal" use "SD1590_normal". I agree that we should try handeling this internally in the future.

Dorothynyamai · 2024-11-22T10:42:43Z

Hi @FriederikeHanssen should I change this only for the normal samples or I should also do it for the tumor samples?
Thank you so much for the quick response.

Dorothynyamai · 2024-11-22T11:20:32Z

@FriederikeHanssen Thank you so much for the recommendation to revise my samplesheet. I have submitted the job and hopefully it will run fine. I have another sample which has several files with the same lane name for one tumor. However, the flowcell Id is different. Kindly suggest how I can modify the samplesheet to avoid file name collisions. Here is a snippet of the sample sheet.
Thank you in advance.SLX-14388_sampleshet_uniqueids.txt

FriederikeHanssen · 2024-11-22T11:28:00Z

the lane number for the same sample-patient combo must be unique. i.e. you have SD0329,1,SD0329_Tumor,lane_5, this combination several times. If they all belong to the same sample, you need to make the lane column unique, for example:

SD0329,1,SD0329_Tumor,lane_5_1,
SD0329,1,SD0329_Tumor,lane_5_2,

and so on

Dorothynyamai · 2024-11-22T11:30:22Z

@FriederikeHanssen Thank you so much for the response. I will make these changes.
Thanks

hsk6328 added the bug Something isn't working label Mar 29, 2024

FriederikeHanssen added the input validation label Sep 19, 2024

FriederikeHanssen changed the title ~~Some issues caused an error in the variant_calling step: There are multiple input files for each of the following file names~~ Non-unique of sample values across several patients causes name clashes Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-unique of `sample` values across several patients causes name clashes #1451

Non-unique of `sample` values across several patients causes name clashes #1451

hsk6328 commented Mar 29, 2024

brandon-hastings commented Aug 8, 2024

hsk6328 commented Aug 8, 2024 via email

asp8200 commented Aug 9, 2024

brandon-hastings commented Aug 31, 2024

hsk6328 commented Aug 31, 2024 via email

brandon-hastings commented Sep 2, 2024

FriederikeHanssen commented Sep 2, 2024

brandon-hastings commented Sep 3, 2024

brandon-hastings commented Sep 19, 2024

hsk6328 commented Sep 19, 2024 via email

FriederikeHanssen commented Sep 19, 2024

brandon-hastings commented Sep 20, 2024

Dorothynyamai commented Nov 22, 2024

FriederikeHanssen commented Nov 22, 2024

Dorothynyamai commented Nov 22, 2024

FriederikeHanssen commented Nov 22, 2024

Dorothynyamai commented Nov 22, 2024

Dorothynyamai commented Nov 22, 2024

FriederikeHanssen commented Nov 22, 2024

Dorothynyamai commented Nov 22, 2024

Non-unique of sample values across several patients causes name clashes #1451

Non-unique of sample values across several patients causes name clashes #1451

Comments

hsk6328 commented Mar 29, 2024

Description of the bug

Command used and terminal output

Relevant files

System information

brandon-hastings commented Aug 8, 2024

hsk6328 commented Aug 8, 2024 via email

asp8200 commented Aug 9, 2024

brandon-hastings commented Aug 31, 2024

hsk6328 commented Aug 31, 2024 via email

brandon-hastings commented Sep 2, 2024

FriederikeHanssen commented Sep 2, 2024

brandon-hastings commented Sep 3, 2024

brandon-hastings commented Sep 19, 2024

hsk6328 commented Sep 19, 2024 via email

FriederikeHanssen commented Sep 19, 2024

brandon-hastings commented Sep 20, 2024

Dorothynyamai commented Nov 22, 2024

FriederikeHanssen commented Nov 22, 2024

Dorothynyamai commented Nov 22, 2024

FriederikeHanssen commented Nov 22, 2024

Dorothynyamai commented Nov 22, 2024

Dorothynyamai commented Nov 22, 2024

FriederikeHanssen commented Nov 22, 2024

Dorothynyamai commented Nov 22, 2024

Non-unique of `sample` values across several patients causes name clashes #1451

Non-unique of `sample` values across several patients causes name clashes #1451