`--USE_END_IN_UNPAIRED_READS` and `--USE_UNPAIRED_CLIPPED_END` options for `picard MarkDuplicates` have no effect #1959

ChrisHIV · 2024-04-16T22:52:27Z

Bug Report

Affected tool(s)

picard MarkDuplicates with the --USE_END_IN_UNPAIRED_READS and --USE_UNPAIRED_CLIPPED_END options

Affected version(s)

Latest public release version [3.1.1]

Description

The --USE_END_IN_UNPAIRED_READS and --USE_UNPAIRED_CLIPPED_END options have no effect in minimal test data. From my understanding of the help messages for these options (reproduced at the bottom of this message), the former should toggle whether or not we consider unpaired reads to be duplicates if they have the same start position but different end positions, and the latter should toggle whether clipped ends of unpaired reads are included or excluded when determining duplicates (I do not understand whether inclusion/exclusion corresponds to true/false for this bool or vice versa).

Steps to reproduce

The attached reads_sam.txt has 8 reads mapped to an 8-bp reference genome, attached as
reference_fasta.txt. (Both files have had their extensions changed to .txt to allow attachment.) The attached image shows the reads for convenience.

These reads all have the central 6bp mapped, but they vary in whether there is an additional base at one end or the other and whether that base is mapped or clipped. After renaming reads_sam.txt to reads_sam.sam to clarify the format, run e.g.

input=reads_sam.sam
java -jar picard.jar MarkDuplicates -I $input --REMOVE_DUPLICATES -M dedup_stats.txt -O test_dedup_NoFlags.sam
java -jar picard.jar MarkDuplicates -I $input --REMOVE_DUPLICATES -M dedup_stats.txt -O test_dedup_USE_END_IN_UNPAIRED_READS.sam --USE_END_IN_UNPAIRED_READS
java -jar picard.jar MarkDuplicates -I $input --REMOVE_DUPLICATES -M dedup_stats.txt -O test_dedup_USE_UNPAIRED_CLIPPED_END.sam --USE_UNPAIRED_CLIPPED_END
java -jar picard.jar MarkDuplicates -I $input --REMOVE_DUPLICATES -M dedup_stats.txt -O test_dedup_USE_END_IN_UNPAIRED_READS__USE_UNPAIRED_CLIPPED_END.sam --USE_END_IN_UNPAIRED_READS --USE_UNPAIRED_CLIPPED_END

Expected behavior

The four output sam files corresponding to the four combinations of these two binary flags should vary in which subset of reads are included after removing duplicates, because the reads vary in their potential to be considered duplicates based on the description of the flags.

Actual behavior

The four output sam files are identical in their read content (containing reads 1 and 6).

Additional comments

I tried to follow the recommendation to first post on the forum, but clicking on the 'Sign in' tab takes me to this page where I cannot see any option to sign in or create a new profile.
And here is the help for those two options, for convenience:

--USE_END_IN_UNPAIRED_READS <Boolean>
                              Make the end location of single end read be significant when considering duplicates, in
                              addition to the start location, which is always significant (i.e. require single-ended
                              reads to start andend on the same position to be considered duplicate) (for this argument,
                              "read end" means 3' end).  Default value: false. Possible values: {true, false} 
--USE_UNPAIRED_CLIPPED_END <Boolean>
                              Use position of the clipping as the end position, when considering duplicates (or use the
                              unclipped end position) (for this argument, "read end" means 3' end).  Default value:
                              false. Possible values: {true, false}

The text was updated successfully, but these errors were encountered:

kockan · 2024-04-17T21:13:42Z

Looks like those options are specific to flow-based reads, therefore they will not be used at all unless you also set --FLOW_MODE true in your command(s).

I've tried this with the example alignments you've provided and while it did run successfully, I'd recommend extra caution if you ever decide to use these options with non-flow-based single-end reads, since I'd assume they were not intended for such use.

ChrisHIV · 2024-04-17T21:53:25Z

Thanks. Indeed using --FLOW_MODE on that test data, --USE_END_IN_UNPAIRED_READS changes which reads are duplicates. However, --USE_UNPAIRED_CLIPPED_END still has no effect and it's unclear to me to why: the description suggests that it toggles whether clipped ends are included or excluded, which I think should affect where reads are deemed to start and end, and thus affect which are considered duplicates.

Maybe helpful to add to the two help messages that --FLOW_MODE must be used? Also good to make the help clearer which of true/false corresponds to inclusion/exclusion of clipped ends for --USE_UNPAIRED_CLIPPED_END.

kockan · 2024-04-22T15:52:45Z

I agree with adding clarification(s) to the documentation. I'll try getting more details on whether these options are safe to be used with non-flow single-end reads before closing the issue.

kachulis · 2024-04-22T15:58:30Z

perhaps there should be some error if those options are used without --FLOW_MODE turned on?

kockan · 2024-04-22T16:01:55Z

Absolutely, if these are truly only meant to be used with flow-based reads.

kockan · 2024-04-22T17:10:01Z

@meganshand @ilyasoifer tagging for opinions on this (and potentially similar issues with other options in this/other tools)

ilyasoifer · 2024-04-23T06:19:03Z

@kockan - thanks! We will discuss and propose how to best deal with this.

ilyasoifer · 2024-05-03T14:29:53Z

@meganshand and @kockan, @ChrisHIV - unless you see a clear use case for single end reads (that are not of a constant length in bases) we will update the help string as suggested by the issue to indicate that it should only be used for flow reads

kockan · 2024-05-03T17:22:31Z

@ilyasoifer Sounds reasonable to me. One small additional request: if these options are set in the command-line arguments without --FLOW_MODE, we should throw an exception (edit: decided an exception would be better than a warning after some initial thought)

droazen · 2024-09-10T17:23:19Z

#1976 was merged, which adds a FLOW_ prefix to these options, but to close out this ticket we also need a check that flow mode is active when these options are specified.

ilyasoifer · 2024-10-04T12:22:39Z

@dror27 - can you please address the last comment so we can close this?

dror27 · 2024-10-13T08:27:36Z

@dror27 - can you please address the last comment so we can close this?

Pull request created with requested change:
#1980

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`--USE_END_IN_UNPAIRED_READS` and `--USE_UNPAIRED_CLIPPED_END` options for `picard MarkDuplicates` have no effect #1959

`--USE_END_IN_UNPAIRED_READS` and `--USE_UNPAIRED_CLIPPED_END` options for `picard MarkDuplicates` have no effect #1959

ChrisHIV commented Apr 16, 2024 •

edited

Loading

kockan commented Apr 17, 2024

ChrisHIV commented Apr 17, 2024

kockan commented Apr 22, 2024

kachulis commented Apr 22, 2024

kockan commented Apr 22, 2024

kockan commented Apr 22, 2024

ilyasoifer commented Apr 23, 2024

ilyasoifer commented May 3, 2024 •

edited

Loading

kockan commented May 3, 2024 •

edited

Loading

droazen commented Sep 10, 2024

ilyasoifer commented Oct 4, 2024

dror27 commented Oct 13, 2024

--USE_END_IN_UNPAIRED_READS and --USE_UNPAIRED_CLIPPED_END options for picard MarkDuplicates have no effect #1959

--USE_END_IN_UNPAIRED_READS and --USE_UNPAIRED_CLIPPED_END options for picard MarkDuplicates have no effect #1959

Comments

ChrisHIV commented Apr 16, 2024 • edited Loading

Bug Report

Affected tool(s)

Affected version(s)

Description

Steps to reproduce

Expected behavior

Actual behavior

Additional comments

kockan commented Apr 17, 2024

ChrisHIV commented Apr 17, 2024

kockan commented Apr 22, 2024

kachulis commented Apr 22, 2024

kockan commented Apr 22, 2024

kockan commented Apr 22, 2024

ilyasoifer commented Apr 23, 2024

ilyasoifer commented May 3, 2024 • edited Loading

kockan commented May 3, 2024 • edited Loading

droazen commented Sep 10, 2024

ilyasoifer commented Oct 4, 2024

dror27 commented Oct 13, 2024

`--USE_END_IN_UNPAIRED_READS` and `--USE_UNPAIRED_CLIPPED_END` options for `picard MarkDuplicates` have no effect #1959

`--USE_END_IN_UNPAIRED_READS` and `--USE_UNPAIRED_CLIPPED_END` options for `picard MarkDuplicates` have no effect #1959

ChrisHIV commented Apr 16, 2024 •

edited

Loading

ilyasoifer commented May 3, 2024 •

edited

Loading

kockan commented May 3, 2024 •

edited

Loading