can umicollapse be used for single umi and duplex umi data #10

worker000000 · 2021-06-15T16:02:14Z

Daer professor,
Thanks a lot for making such a super tool, can it be used for single umi and duplex umi data?

worker000000 · 2021-06-15T16:23:56Z

always we have data of fastq, and I saw this tools supply fastq and bam mode
<>
--merge: method for identifying which UMI to keep out of every two UMIs. Either any, avgqual, or mapqual. Default: mapqual for SAM/BAM mode, avgqual for FASTQ mode.
<>

can you share a full command of this?

<>
<>
--paired: use paired-end mode, which deduplicates pairs of reads from a SAM/BAM file. The template length of each read pair, along with the alignment coordinate and UMI of the forwards read, are used to deduplicate read pairs. This is very memory intensive, and the input SAM/BAM files should be sorted. Default: false (single-end).

hows should I prepare this bam? first extract umi from fastq and bwa mem and sort? or just ** bwa mem and sort?**

Daniel-Liu-c0deb0t · 2021-06-15T23:07:52Z

FASTQ and BAM data are processed differently.

FASTQ data is deduplicated based on the entire read. This mode does not support paired-end reads. This mode is used to deduplicate data without having to align to a reference. Aligning first is time-consuming, but it may give better results.

Here is an example of the merge flag:

./umicollapse fastq -i input.fastq -o output.fastq --merge avgqual

However, this is redundant because by default avgqual is used for FASTQ mode. This means that the read with the highest average quality score is the only one that is output when collapsing a group of multiple reads.

For BAM mode, two major steps have to be done before running UMICollapse:

Extract UMIs from reads in FASTQ format and add it to the headers
Align reads to get a BAM file

UMICollapse does not do these two steps. You should follow the instructions here: https://umi-tools.readthedocs.io/en/latest/QUICK_START.html
The only difference is using UMICollapse instead of UMI-tools. UMICollapse should be much faster than UMI-tools, but it should produce very similar results.

Let me know if you have any other questions. Also I'm not a professor.

worker000000 · 2021-06-16T01:40:17Z

Thanks a lot for your qucik and help reply, I have some other question.
1 can umicollapse be used for single umi and duplex umi data
2 how the autodetect model work and is it correct enough?
3 in my umi mode, a duplex umi(both reads have umi), and the first base of the umi is not in good sequence quality, so we ignore it , and the next three base is my umi of three bases, and the next the the base T (for T A ligation), can I use antodetect mode?
4 in the UMI-tools, it used bowtie, is it better than bowtie2 and bwa?

Daniel-Liu-c0deb0t · 2021-06-16T04:36:24Z

For the first 3 questions:
UMICollapse can handle single UMIs. In paired-end mode, it will ignore the UMI of the second read.

How the UMI is preprocessed is not handled by UMICollapse. You will have to extract the UMIs from the reads, remove the first base, and put this cleaned up UMI in the read header before alignment. The UMI in the header is what is used by UMICollapse. The only thing that is autodetected is the length of this UMI in the header. UMI-tools provides a way to extract UMIs, ignore bases, and put them in the header, based on a certain pattern.

For the fourth question, newer tools are probably better, but I'm not sure.

worker000000 · 2021-06-16T13:52:32Z

Thanks a lot for your kind and fast reply.
1 In paired-end mode, it will ignore the UMI of the second read. so will it affect the accuracy of data, such as false positive
variants just in one strand, why not use both umi, is there any inner reason,

<>
<>
2 my umi is 5 base umi, it is in the 5 end of reads1 and reads2, the first base of umi is always low quality, so it needs to be removed, the last base of umi is a constant base(which is for T/A ligation)

I tried to use umi_tools extract like this
umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I t_1.fq.gz -S out.R1_TMP_umitools.fq.gz --read2-in=t_2.fq.gz --read2-out=out.R2_TMP_umitools.fq.gz
but the header for the mate read in reads1 and reads is like such
<>
@A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 1:N:0:GCAGCTGT+GCTCTAGT
@A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 2:N:0:GCAGCTGT+GCTCTAGT
<>
which is not what I expected, in umi_tools, where C = cell barcode, N = umi, P = plate, X=read sequence, is there any error of my command
<>
<>

Daniel-Liu-c0deb0t · 2021-06-16T21:21:44Z

It seems that 2 is answered by the excellent maintainers behind UMI-tools.

For 1, UMICollapse was originally created as a proof of concept for better algorithms for deduplicating many, many UMIs. This meant that not all features were implemented, only the most important ones with single-end sequences. Later, more features were added due to user request, but it is still not as feature-complete as UMI-tools, which has existed for a long time. I would recommend UMICollapse only for cases where they encounter issues with other tools on massive datasets. I agree with the UMI-tools maintainers that with only 6bp UMIs, there wouldn't be a lot of UMIs to deduplicate.

For your case, if you really wanted to use UMICollapse, there is a workaround where you extract the UMIs from both reads and place them in the header of the first read, then deduplicate using paired end mode.

worker000000 · 2021-06-19T12:33:35Z

thanks a lot, so I need to remove the umi in the reads2, is that right?

Daniel-Liu-c0deb0t · 2021-06-19T21:00:27Z

Ideally, you would remove the UMI from read2 and concatenate it to the UMI of read1 (to form a 6bp UMI) and place this UMI in the read1 FASTQ headers.

worker000000 · 2021-06-20T06:48:07Z

is there any tools to remove this effiently, thanks a lot

Daniel-Liu-c0deb0t · 2021-06-21T02:01:26Z

Perhaps you can do it with UMI-tools? They have a way of extracting UMIs from read1 and read2 and putting them in the respective headers.

If you want to concatenate the UMIs and put them in the read1 header, then you may have to write a simple script to do it. I don't think UMI-tools can handle that.

worker000000 · 2021-06-21T03:12:15Z

thanks a lot, do you mean do as following?

umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I 28_1.fq.gz -S R1_TMP_umitools.fq.gz --read2-in=28_2.fq.gz --read2-out=R2_TMP_umitools.fq.gz

delete cell_code and umi from fq2

zless R2_TMP_umitools.fq.gz | sed -r 's#(@[^_]+)_[^ ]+( 2:N:0)#\1\2#' | pigz - > n2.fq.gz

bwa and samtools

./umicollapse bam -i paired_example.bam -o dedup_paired_example.bam --umi-sep _ --paired --two-pass

worker000000 · 2021-06-21T03:13:51Z

can you have a look of my issuses in this CGATOxford/UMI-tools#477 , thanks a lot

Daniel-Liu-c0deb0t · 2021-06-21T03:35:45Z

Are you removing the UMIs in the FASTQ headers for read2? You do not need to do that. You only need to extract the UMI from the read sequences so it does not interfere with alignment (this was what I meant by "remove" in my previous comments; the UMIs need to be removed from the sequences, but not the headers). UMICollapse simply ignores the UMIs that are in the headers of the read2 FASTQ files, so there is no need to remove them. Sorry, I was not very clear on this.

I hate to say this but I can't write your pipeline for you. I can only provide help related to this tool, so if you have more general concerns I suggest asking on biostars or something.

worker000000 · 2021-06-21T14:28:48Z

Are you removing the UMIs in the FASTQ headers for read2? You do not need to do that. You only need to extract the UMI from the read sequences so it does not interfere with alignment (this was what I meant by "remove" in my previous comments; the UMIs need to be removed from the sequences, but not the headers). UMICollapse simply ignores the UMIs that are in the headers of the read2 FASTQ files, so there is no need to remove them. Sorry, I was not very clear on this.

I hate to say this but I can't write your pipeline for you. I can only provide help related to this tool, so if you have more general concerns I suggest asking on biostars or something.

Thanks a lot, so your meanning is that, do as pair-end mode,
so when you said just use umi consensus of reads1, after you remove many error reads in reads1, how will you treat the mate reads in reads2, I am curious about this

Daniel-Liu-c0deb0t · 2021-06-21T22:40:51Z

When you pass in the --paired flag, any read1 that is removed will cause its corresponding read2 to be removed too. (Same behavior as UMI-tools)

worker000000 · 2021-06-22T00:55:19Z

When you pass in the --paired flag, any read1 that is removed will cause its corresponding read2 to be removed too. (Same behavior as UMI-tools)

Thanks a lot fot your helpful answer

TomSmithCGAT mentioned this issue Jun 16, 2021

umi in 5 end of both R1 and R2 CGATOxford/UMI-tools#477

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can umicollapse be used for single umi and duplex umi data #10

can umicollapse be used for single umi and duplex umi data #10

worker000000 commented Jun 15, 2021

worker000000 commented Jun 15, 2021

Daniel-Liu-c0deb0t commented Jun 15, 2021 •

edited

Loading

worker000000 commented Jun 16, 2021 •

edited

Loading

Daniel-Liu-c0deb0t commented Jun 16, 2021

worker000000 commented Jun 16, 2021

Daniel-Liu-c0deb0t commented Jun 16, 2021

worker000000 commented Jun 19, 2021

Daniel-Liu-c0deb0t commented Jun 19, 2021

worker000000 commented Jun 20, 2021

Daniel-Liu-c0deb0t commented Jun 21, 2021

worker000000 commented Jun 21, 2021

worker000000 commented Jun 21, 2021

Daniel-Liu-c0deb0t commented Jun 21, 2021

worker000000 commented Jun 21, 2021

Daniel-Liu-c0deb0t commented Jun 21, 2021 •

edited

Loading

worker000000 commented Jun 22, 2021

can umicollapse be used for single umi and duplex umi data #10

can umicollapse be used for single umi and duplex umi data #10

Comments

worker000000 commented Jun 15, 2021

worker000000 commented Jun 15, 2021

Daniel-Liu-c0deb0t commented Jun 15, 2021 • edited Loading

worker000000 commented Jun 16, 2021 • edited Loading

Daniel-Liu-c0deb0t commented Jun 16, 2021

worker000000 commented Jun 16, 2021

Daniel-Liu-c0deb0t commented Jun 16, 2021

worker000000 commented Jun 19, 2021

Daniel-Liu-c0deb0t commented Jun 19, 2021

worker000000 commented Jun 20, 2021

Daniel-Liu-c0deb0t commented Jun 21, 2021

worker000000 commented Jun 21, 2021

delete cell_code and umi from fq2

bwa and samtools

worker000000 commented Jun 21, 2021

Daniel-Liu-c0deb0t commented Jun 21, 2021

worker000000 commented Jun 21, 2021

Daniel-Liu-c0deb0t commented Jun 21, 2021 • edited Loading

worker000000 commented Jun 22, 2021

Daniel-Liu-c0deb0t commented Jun 15, 2021 •

edited

Loading

worker000000 commented Jun 16, 2021 •

edited

Loading

Daniel-Liu-c0deb0t commented Jun 21, 2021 •

edited

Loading