Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

umi in 5 end of both R1 and R2 #477

Closed
worker000000 opened this issue Jun 16, 2021 · 21 comments
Closed

umi in 5 end of both R1 and R2 #477

worker000000 opened this issue Jun 16, 2021 · 21 comments

Comments

@worker000000
Copy link

worker000000 commented Jun 16, 2021

Dear professor.
Thanks a lot for your powerful tool
1 In paired-end mode, it will ignore the UMI of the second read. so will it affect the accuracy of data, such as false positive
variants just in one strand, why not use both umi, is there any inner reason,

<>
<>
2 my umi is 5 base umi, it is in the 5 end of reads1 and reads2, the first base of umi is always low quality, so it needs to be removed, the last base of umi is a constant base(which is for T/A ligation)

I tried to use umi_tools extract like this
umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I t_1.fq.gz -S out.R1_TMP_umitools.fq.gz --read2-in=t_2.fq.gz --read2-out=out.R2_TMP_umitools.fq.gz
but the header for the mate read in reads1 and reads is like such
<>
@A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 1:N:0:GCAGCTGT+GCTCTAGT
@A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 2:N:0:GCAGCTGT+GCTCTAGT
<>
which is not what I expected, in umi_tools, where C = cell barcode, N = umi, P = plate, X=read sequence, is there any error of my command
<>
<>
image

<>
<>
3 I want to know if umi_tools is suitable for umi in both reads, because tools like gencore , UMICollapse can not handle umi in both reads, which may arise false postive variant in one strand?

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Jun 16, 2021

Hello.

In future, please be clear when you are cross posting. Especially if you dive straight mid way into another issue. @IanSudbery - This is a cross-post from: Daniel-Liu-c0deb0t/UMICollapse#10

As per @Daniel-Liu-c0deb0t's replies, no professors here either...

I think the confusion above stems from the format of the header post extraction. For paired end reads, the barcodes (Cell or UMI) are concatenated. So in the above example, the cell is TT (read1) + CT (read2) and the UMI is ACC (read1) + TAA (read2).

However, I'm not sure that's what you want.You say:

my umi is 5 base umi, it is in the 5 end of reads1 and reads2, the first base of umi is always low quality, so it needs to be removed, the last base of umi is a constant base(which is for T/A ligation)

It sounds like you don't want to treat the first and last bases in the pattern as 'cell barcodes', as you have done, but rather bases to be skipped. In which case, your patterns should be XNNNX. See here for more details on barcode patterns.

@IanSudbery
Copy link
Member

UMI-tools should work perfectly fine with your read design, and your command is correct as far as I'm aware. The only thing I would probably do is discard your first and last bases rather than make them the CB. So, both barcodes would be "XNNNX".

What would you expect from your command if it was doing what you thought it would?

@TomSmithCGAT
Copy link
Member

+1 to Ian for succinctness. -1 for tardiness 😉

@worker000000
Copy link
Author

@IanSudbery @TomSmithCGAT
Thanks a lot, I am sorry for post in 2 repo.
my data is for tumor ngs dna sequence, not single cell
after using tools like fastp, I should have remove adapter in my reads, so next step is to extract umi, I want to exttract the first 5 base(or to to remove from reads first), then the inner three base is treated as umi to be added into the header, this is what I want.
<>

<>
origin reads1
@A00582:632:H7F23DSX2:3:1101:4399:1251 1:N:0:GCAGCTGT+GCTCTAGT
TACCTCGCCATCCTGTCCAATGAGCATGGCTCCTACAGGTACACGGAGTTCCTGACGGGCCTGGGCCGGCTCATCGAGCTGAAGGACTGCCAGCCGGACAAGGTGTACCTGGGAGGCCTGGACGTGTGTGGTGAGGACGGCCAGTTCACC

origin reads2
@A00582:632:H7F23DSX2:3:1101:4399:1251 2:N:0:GCAGCTGT+GCTCTAGT
CTAATGGGTAGGCGCCAGGCCGTACCTTGCATGATGTCATCGTGCCAGCAGTAGGTGAACTGGCCGTCCTCACCACACACGTCCAGGCCTCCCAGGTACACCTTGTCCGGCTGGCAGTCCTTCAGCTCGATGAGCCGGCCCAGGCCCGTC

I tried to use umi_tools extract like this
umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I t_1.fq.gz -S out.R1_TMP_umitools.fq.gz --read2-in=t_2.fq.gz --read2-out=out.R2_TMP_umitools.fq.gz
but the header for the mate read in reads1 and reads is like such
<>
@A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 1:N:0:GCAGCTGT+GCTCTAGT
@A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 2:N:0:GCAGCTGT+GCTCTAGT

@worker000000
Copy link
Author

I want to know if umi_tools is suitable for umi in both reads, because tools like gencore , UMICollapse can not handle umi in both reads, which may arise false postive variant in one strand?
Thanks a lot

@IanSudbery
Copy link
Member

Yes UMI-tools can handle UMIs in both reads.

I'm sorry, I still not quite sure what the problem is:

The command you have run HAS taken the 3 base UMI from each read and added it to the read header.
image

@worker000000
Copy link
Author

Thanks a lot
because X is sequence, whci maybe T(template) like command ExtractUmisFromBam, so I am a little confused abot C and X
what is TTCT, I used to treat TTCT as umi from reads1, and base after seperator _ as umi from reads2,
when umi_tools do umi consensus, does it just use base after seperator _, does the bases before seperator _ has any useage?

@IanSudbery
Copy link
Member

Bases identified by "C" in the barcode are treated as Cell Barcode. Bases identified as N are treated as UMI, bases identified as X are treated as neither and are left on the read. If a cell barcode is not specified the header will be:

barcode = XNNNX
@header
XNNNXATGGATAGAGATACAG

becomes

@header_NNN
XXATGGATAGAGATACAG

With Cs in the pattern, the following happens:

barcode = CNNNC
@header
CNNNCATGGATAGAGATACAG

becomes

@header_CC_NNN
ATGGATAGAGATACAG

I'm only showing one read, but the same logic applies to pairs.

UMI-tools only uses the last part of the header (NNN) in the UMI-deduplication process. The CC will be ignored in deduplication unless UMI-tools is run with the --per-cell flag.

@worker000000
Copy link
Author

Thanks a lot, so both CNNNC or XNNNX works the same for me, am I right?

https://github.com/CGATOxford/UMI-tools/blob/master/doc/QUICK_START.md

1 I have another important question, since error corection is an important part of umi, so function group in umi_tools is a seperated part for this? or it is wrapped in function dedup?

or should do as following
extrract , aligin, sort, group, dedup?

2 in the link, you used bowtie, is it a better choice than bowtie2 or bwa for dna sequence?

@IanSudbery
Copy link
Member

  1. If you run dedup, you don't need to run group - dedup does everything group does, and then selects one UMI from each UMI group.
  2. The aligner makes no difference to UMI-tools, use whichever works best for your application - bowtie2 and bwa should both work fine.

@worker000000
Copy link
Author

Thanks a lot
1 so command group is just for the stastic of read groups and read identifiers?
2 I am here still want to confrim with you, both CNNNC or XNNNX works the same for me, am I right?

@IanSudbery
Copy link
Member

  1. People use the group command when they want to do other things with the reads downstream than just deduplicate them. For example, perhaps they want to calculate consensus sequences of the reads, or look at allelic ratios. The dedup command just outputs one read per UMI group, the group command outputs all reads in the orignal BAM file, with corrected UMI sequences, so that the user can do whatever they want.
  2. XNNNX will return the XX to the ends of your reads. If you use that you will either have to configure your aligner to allow that, or trim them using a trimmer.

@worker000000
Copy link
Author

Thanks a lot,
If I do group, and sent the output bam to dedup, will this get unexpected result?

@IanSudbery
Copy link
Member

IanSudbery commented Jun 16, 2021 via email

@worker000000
Copy link
Author

thanks a lot.
have you ever compare single umi and pair end umi result, becasue tools like gencore or umicollapse does not hanle pair-end data, but it is fast enough, umi-tools can do pair-end sequence, but not fast as them, people sometimes come with trade-offs, hard to choose

@IanSudbery
Copy link
Member

IanSudbery commented Jun 16, 2021 via email

@worker000000
Copy link
Author

worker000000 commented Jun 17, 2021

Thanks a lot for your kind and fast reply.
I am here have another question about the arfument。 because --read2-in just described as Filename for read pairs, but not saying specific for reads2.

there are 6 arguments
--bc-pattern, --bc-pattern2, --read2-in, --read2-out, -I, -S

does it has a One to one match?
--bc-pattern, -I, -S for reads1
--bc-pattern2, --read2-in, --read2-out for reads2

2 if there is only single umi in reads1 or reads2, does it mean I just need to clarify one of the --bc-pattern,
for example,
if umi just in reads1,

umi_tools extract -I $fq1 --bc-pattern=CNNNC --read2-in=$fq2 --stdout=${sampleID}.R1.umi.fq.gz --read2-out=${sampleID}.R2.umi.fq.gz

if umi just in reads2,

umi_tools extract -I $fq1 --bc-pattern2=CNNNC --read2-in=$fq2 --stdout=${sampleID}.R1.umi.fq.gz --read2-out=${sampleID}.R2.umi.fq.gz

does the deafault value of --bc-pattern2 --bc-pattern is NONE?

@worker000000
Copy link
Author

worker000000 commented Jun 17, 2021

1 should I trim adapter before I deal with umi(like tools trimmomatic or fastp or cutadpters), I guess it is a must, but I want to confirm it with you,

2 tools like umi_tools just remove the reads not needed intead of just adding duplicated tag , like tools samtools markdup or picard markdup?

@worker000000
Copy link
Author

because I also use ngs data for cnv calling, do you think umi data is needed for cnv(for example the dedup function in umitools), I think I just need to extract the umi data, but not needed to do consensus reads, how do you think ?

@IanSudbery
Copy link
Member

IanSudbery commented Jun 21, 2021 via email

@worker000000
Copy link
Author

worker000000 commented Jun 21, 2021

@IanSudbery
Thanks a lot
why you agree that cnv is not needed to do consensus reads

can you share your opion about my question

1 should I trim adapter before I deal with umi(like tools trimmomatic or fastp or cutadpters), I guess it is a must, but I want to confirm it with you,

2 tools like umi_tools just remove the reads not needed intead of just adding duplicated tag , like tools samtools markdup or picard markdup?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants