umi in 5 end of both R1 and R2 #477

worker000000 · 2021-06-16T13:55:54Z

Dear professor.
Thanks a lot for your powerful tool
1 In paired-end mode, it will ignore the UMI of the second read. so will it affect the accuracy of data, such as false positive
variants just in one strand, why not use both umi, is there any inner reason,

<>
<>
2 my umi is 5 base umi, it is in the 5 end of reads1 and reads2, the first base of umi is always low quality, so it needs to be removed, the last base of umi is a constant base(which is for T/A ligation)

I tried to use umi_tools extract like this
umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I t_1.fq.gz -S out.R1_TMP_umitools.fq.gz --read2-in=t_2.fq.gz --read2-out=out.R2_TMP_umitools.fq.gz
but the header for the mate read in reads1 and reads is like such
<>
@A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 1:N:0:GCAGCTGT+GCTCTAGT
@A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 2:N:0:GCAGCTGT+GCTCTAGT
<>
which is not what I expected, in umi_tools, where C = cell barcode, N = umi, P = plate, X=read sequence, is there any error of my command
<>
<>

<>
<>
3 I want to know if umi_tools is suitable for umi in both reads, because tools like gencore , UMICollapse can not handle umi in both reads, which may arise false postive variant in one strand?

TomSmithCGAT · 2021-06-16T14:17:11Z

Hello.

In future, please be clear when you are cross posting. Especially if you dive straight mid way into another issue. @IanSudbery - This is a cross-post from: Daniel-Liu-c0deb0t/UMICollapse#10

As per @Daniel-Liu-c0deb0t's replies, no professors here either...

I think the confusion above stems from the format of the header post extraction. For paired end reads, the barcodes (Cell or UMI) are concatenated. So in the above example, the cell is TT (read1) + CT (read2) and the UMI is ACC (read1) + TAA (read2).

However, I'm not sure that's what you want.You say:

my umi is 5 base umi, it is in the 5 end of reads1 and reads2, the first base of umi is always low quality, so it needs to be removed, the last base of umi is a constant base(which is for T/A ligation)

It sounds like you don't want to treat the first and last bases in the pattern as 'cell barcodes', as you have done, but rather bases to be skipped. In which case, your patterns should be XNNNX. See here for more details on barcode patterns.

IanSudbery · 2021-06-16T14:17:19Z

UMI-tools should work perfectly fine with your read design, and your command is correct as far as I'm aware. The only thing I would probably do is discard your first and last bases rather than make them the CB. So, both barcodes would be "XNNNX".

What would you expect from your command if it was doing what you thought it would?

TomSmithCGAT · 2021-06-16T14:19:25Z

+1 to Ian for succinctness. -1 for tardiness 😉

worker000000 · 2021-06-16T14:45:01Z

@IanSudbery @TomSmithCGAT
Thanks a lot, I am sorry for post in 2 repo.
my data is for tumor ngs dna sequence, not single cell
after using tools like fastp, I should have remove adapter in my reads, so next step is to extract umi, I want to exttract the first 5 base(or to to remove from reads first), then the inner three base is treated as umi to be added into the header, this is what I want.
<>

<>
origin reads1
@A00582:632:H7F23DSX2:3:1101:4399:1251 1:N:0:GCAGCTGT+GCTCTAGT
TACCTCGCCATCCTGTCCAATGAGCATGGCTCCTACAGGTACACGGAGTTCCTGACGGGCCTGGGCCGGCTCATCGAGCTGAAGGACTGCCAGCCGGACAAGGTGTACCTGGGAGGCCTGGACGTGTGTGGTGAGGACGGCCAGTTCACC

origin reads2
@A00582:632:H7F23DSX2:3:1101:4399:1251 2:N:0:GCAGCTGT+GCTCTAGT
CTAATGGGTAGGCGCCAGGCCGTACCTTGCATGATGTCATCGTGCCAGCAGTAGGTGAACTGGCCGTCCTCACCACACACGTCCAGGCCTCCCAGGTACACCTTGTCCGGCTGGCAGTCCTTCAGCTCGATGAGCCGGCCCAGGCCCGTC

I tried to use umi_tools extract like this
umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I t_1.fq.gz -S out.R1_TMP_umitools.fq.gz --read2-in=t_2.fq.gz --read2-out=out.R2_TMP_umitools.fq.gz
but the header for the mate read in reads1 and reads is like such
<>
@A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 1:N:0:GCAGCTGT+GCTCTAGT
@A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 2:N:0:GCAGCTGT+GCTCTAGT

worker000000 · 2021-06-16T14:45:46Z

I want to know if umi_tools is suitable for umi in both reads, because tools like gencore , UMICollapse can not handle umi in both reads, which may arise false postive variant in one strand?
Thanks a lot

IanSudbery · 2021-06-16T14:53:53Z

Yes UMI-tools can handle UMIs in both reads.

I'm sorry, I still not quite sure what the problem is:

The command you have run HAS taken the 3 base UMI from each read and added it to the read header.

worker000000 · 2021-06-16T15:01:03Z

Thanks a lot
because X is sequence, whci maybe T(template) like command ExtractUmisFromBam, so I am a little confused abot C and X
what is TTCT, I used to treat TTCT as umi from reads1, and base after seperator _ as umi from reads2,
when umi_tools do umi consensus, does it just use base after seperator _, does the bases before seperator _ has any useage?

IanSudbery · 2021-06-16T15:12:16Z

Bases identified by "C" in the barcode are treated as Cell Barcode. Bases identified as N are treated as UMI, bases identified as X are treated as neither and are left on the read. If a cell barcode is not specified the header will be:

barcode = XNNNX
@header
XNNNXATGGATAGAGATACAG

becomes

@header_NNN
XXATGGATAGAGATACAG

With Cs in the pattern, the following happens:

barcode = CNNNC
@header
CNNNCATGGATAGAGATACAG

becomes

@header_CC_NNN
ATGGATAGAGATACAG

I'm only showing one read, but the same logic applies to pairs.

UMI-tools only uses the last part of the header (NNN) in the UMI-deduplication process. The CC will be ignored in deduplication unless UMI-tools is run with the --per-cell flag.

worker000000 · 2021-06-16T15:20:05Z

Thanks a lot, so both CNNNC or XNNNX works the same for me, am I right?

https://github.com/CGATOxford/UMI-tools/blob/master/doc/QUICK_START.md

1 I have another important question, since error corection is an important part of umi, so function group in umi_tools is a seperated part for this? or it is wrapped in function dedup?

or should do as following
extrract , aligin, sort, group, dedup?

2 in the link, you used bowtie, is it a better choice than bowtie2 or bwa for dna sequence?

IanSudbery · 2021-06-16T15:22:42Z

If you run dedup, you don't need to run group - dedup does everything group does, and then selects one UMI from each UMI group.
The aligner makes no difference to UMI-tools, use whichever works best for your application - bowtie2 and bwa should both work fine.

worker000000 · 2021-06-16T15:26:10Z

Thanks a lot
1 so command group is just for the stastic of read groups and read identifiers?
2 I am here still want to confrim with you, both CNNNC or XNNNX works the same for me, am I right?

IanSudbery · 2021-06-16T15:36:11Z

People use the group command when they want to do other things with the reads downstream than just deduplicate them. For example, perhaps they want to calculate consensus sequences of the reads, or look at allelic ratios. The dedup command just outputs one read per UMI group, the group command outputs all reads in the orignal BAM file, with corrected UMI sequences, so that the user can do whatever they want.
XNNNX will return the XX to the ends of your reads. If you use that you will either have to configure your aligner to allow that, or trim them using a trimmer.

worker000000 · 2021-06-16T16:22:17Z

Thanks a lot,
If I do group, and sent the output bam to dedup, will this get unexpected result?

IanSudbery · 2021-06-16T16:24:19Z

If you send the output of group to dedup, you will get exactly the same result as if you sent the input directly to dedup without using group.

…

On Wed, 16 Jun 2021 at 17:22, worker000000 ***@***.***> wrote: Thanks a lot, If I do group, and sent the output bam to dedup, will this get unexpected result? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#477 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJELDSS4QPGOVWHT2I67UTTTDFUTANCNFSM46ZQGF4A> .

worker000000 · 2021-06-16T16:34:03Z

thanks a lot.
have you ever compare single umi and pair end umi result, becasue tools like gencore or umicollapse does not hanle pair-end data, but it is fast enough, umi-tools can do pair-end sequence, but not fast as them, people sometimes come with trade-offs, hard to choose

IanSudbery · 2021-06-16T16:39:00Z

We've never done any rigorous benchmarking, but it stands to reason that longer UMI sequences = more accuracy. I'm not entirely sure what the reasoning is for one 3bp UMI on each read being different from a 6nt UMI on just one of them though. BTW, I wouldn't have thought you'll have much trouble, run-time wise with a 6nt UMI - I suspect that even if all UMIs are used, UMI-tools still shouldn't struggle because there are only 4096 possible UMIs.

…

worker000000 · 2021-06-17T02:03:14Z

Thanks a lot for your kind and fast reply.
I am here have another question about the arfument。 because --read2-in just described as Filename for read pairs, but not saying specific for reads2.

there are 6 arguments
--bc-pattern, --bc-pattern2， --read2-in， --read2-out, -I, -S

does it has a One to one match?
--bc-pattern, -I, -S for reads1
--bc-pattern2, --read2-in， --read2-out for reads2

2 if there is only single umi in reads1 or reads2, does it mean I just need to clarify one of the --bc-pattern,
for example,
if umi just in reads1,

umi_tools extract -I $fq1 --bc-pattern=CNNNC --read2-in=$fq2 --stdout=${sampleID}.R1.umi.fq.gz --read2-out=${sampleID}.R2.umi.fq.gz

if umi just in reads2,

umi_tools extract -I $fq1 --bc-pattern2=CNNNC --read2-in=$fq2 --stdout=${sampleID}.R1.umi.fq.gz --read2-out=${sampleID}.R2.umi.fq.gz

does the deafault value of --bc-pattern2 --bc-pattern is NONE?

worker000000 · 2021-06-17T02:44:20Z

1 should I trim adapter before I deal with umi(like tools trimmomatic or fastp or cutadpters), I guess it is a must, but I want to confirm it with you,

2 tools like umi_tools just remove the reads not needed intead of just adding duplicated tag , like tools samtools markdup or picard markdup?

worker000000 · 2021-06-17T15:37:01Z

because I also use ngs data for cnv calling, do you think umi data is needed for cnv(for example the dedup function in umitools), I think I just need to extract the umi data, but not needed to do consensus reads, how do you think ?

IanSudbery · 2021-06-21T12:06:06Z

I agree

…

On Thu, 17 Jun 2021 at 16:37, worker000000 ***@***.***> wrote: because I also use ngs data for cnv calling, do you think umi data is needed for cnv(for example the dedup function in umitools), I think I just need to extract the umi data, but not needed to do consensus reads, how do you think ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#477 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJELDQHHIYTZFMV6SUUBYLTTIJCRANCNFSM46ZQGF4A> .

worker000000 · 2021-06-21T14:25:02Z

@IanSudbery
Thanks a lot
why you agree that cnv is not needed to do consensus reads

can you share your opion about my question

1 should I trim adapter before I deal with umi(like tools trimmomatic or fastp or cutadpters), I guess it is a must, but I want to confirm it with you,

2 tools like umi_tools just remove the reads not needed intead of just adding duplicated tag , like tools samtools markdup or picard markdup?

worker000000 mentioned this issue Jun 21, 2021

can umicollapse be used for single umi and duplex umi data Daniel-Liu-c0deb0t/UMICollapse#10

Open

TomSmithCGAT closed this as completed Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

umi in 5 end of both R1 and R2 #477

umi in 5 end of both R1 and R2 #477

worker000000 commented Jun 16, 2021 •

edited

Loading

TomSmithCGAT commented Jun 16, 2021 •

edited

Loading

IanSudbery commented Jun 16, 2021

TomSmithCGAT commented Jun 16, 2021

worker000000 commented Jun 16, 2021

worker000000 commented Jun 16, 2021

IanSudbery commented Jun 16, 2021

worker000000 commented Jun 16, 2021

IanSudbery commented Jun 16, 2021

worker000000 commented Jun 16, 2021

IanSudbery commented Jun 16, 2021

worker000000 commented Jun 16, 2021

IanSudbery commented Jun 16, 2021

worker000000 commented Jun 16, 2021

IanSudbery commented Jun 16, 2021 via email

worker000000 commented Jun 16, 2021

IanSudbery commented Jun 16, 2021 via email

worker000000 commented Jun 17, 2021 •

edited

Loading

worker000000 commented Jun 17, 2021 •

edited

Loading

worker000000 commented Jun 17, 2021

IanSudbery commented Jun 21, 2021 via email

worker000000 commented Jun 21, 2021 •

edited

Loading

umi in 5 end of both R1 and R2 #477

umi in 5 end of both R1 and R2 #477

Comments

worker000000 commented Jun 16, 2021 • edited Loading

TomSmithCGAT commented Jun 16, 2021 • edited Loading

IanSudbery commented Jun 16, 2021

TomSmithCGAT commented Jun 16, 2021

worker000000 commented Jun 16, 2021

worker000000 commented Jun 16, 2021

IanSudbery commented Jun 16, 2021

worker000000 commented Jun 16, 2021

IanSudbery commented Jun 16, 2021

worker000000 commented Jun 16, 2021

IanSudbery commented Jun 16, 2021

worker000000 commented Jun 16, 2021

IanSudbery commented Jun 16, 2021

worker000000 commented Jun 16, 2021

IanSudbery commented Jun 16, 2021 via email

worker000000 commented Jun 16, 2021

IanSudbery commented Jun 16, 2021 via email

worker000000 commented Jun 17, 2021 • edited Loading

worker000000 commented Jun 17, 2021 • edited Loading

worker000000 commented Jun 17, 2021

IanSudbery commented Jun 21, 2021 via email

worker000000 commented Jun 21, 2021 • edited Loading

worker000000 commented Jun 16, 2021 •

edited

Loading

TomSmithCGAT commented Jun 16, 2021 •

edited

Loading

worker000000 commented Jun 17, 2021 •

edited

Loading

worker000000 commented Jun 17, 2021 •

edited

Loading

worker000000 commented Jun 21, 2021 •

edited

Loading