-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
umi in 5 end of both R1 and R2 #477
Comments
Hello. In future, please be clear when you are cross posting. Especially if you dive straight mid way into another issue. @IanSudbery - This is a cross-post from: Daniel-Liu-c0deb0t/UMICollapse#10 As per @Daniel-Liu-c0deb0t's replies, no professors here either... I think the confusion above stems from the format of the header post extraction. For paired end reads, the barcodes (Cell or UMI) are concatenated. So in the above example, the cell is TT (read1) + CT (read2) and the UMI is ACC (read1) + TAA (read2). However, I'm not sure that's what you want.You say:
It sounds like you don't want to treat the first and last bases in the pattern as 'cell barcodes', as you have done, but rather bases to be skipped. In which case, your patterns should be |
UMI-tools should work perfectly fine with your read design, and your command is correct as far as I'm aware. The only thing I would probably do is discard your first and last bases rather than make them the CB. So, both barcodes would be "XNNNX". What would you expect from your command if it was doing what you thought it would? |
+1 to Ian for succinctness. -1 for tardiness 😉 |
@IanSudbery @TomSmithCGAT <> origin reads2 I tried to use umi_tools extract like this |
I want to know if umi_tools is suitable for umi in both reads, because tools like gencore , UMICollapse can not handle umi in both reads, which may arise false postive variant in one strand? |
Thanks a lot |
Bases identified by "C" in the barcode are treated as Cell Barcode. Bases identified as N are treated as UMI, bases identified as X are treated as neither and are left on the read. If a cell barcode is not specified the header will be:
becomes
With Cs in the pattern, the following happens:
becomes
I'm only showing one read, but the same logic applies to pairs. UMI-tools only uses the last part of the header (NNN) in the UMI-deduplication process. The CC will be ignored in deduplication unless UMI-tools is run with the |
Thanks a lot, so both CNNNC or XNNNX works the same for me, am I right? https://github.com/CGATOxford/UMI-tools/blob/master/doc/QUICK_START.md 1 I have another important question, since error corection is an important part of umi, so function group in umi_tools is a seperated part for this? or it is wrapped in function dedup? or should do as following 2 in the link, you used bowtie, is it a better choice than bowtie2 or bwa for dna sequence? |
|
Thanks a lot |
|
Thanks a lot, |
If you send the output of group to dedup, you will get exactly the same
result as if you sent the input directly to dedup without using group.
…On Wed, 16 Jun 2021 at 17:22, worker000000 ***@***.***> wrote:
Thanks a lot,
If I do group, and sent the output bam to dedup, will this get unexpected
result?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#477 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJELDSS4QPGOVWHT2I67UTTTDFUTANCNFSM46ZQGF4A>
.
|
thanks a lot. |
We've never done any rigorous benchmarking, but it stands to reason that
longer UMI sequences = more accuracy. I'm not entirely sure what the
reasoning is for one 3bp UMI on each read being different from a 6nt UMI on
just one of them though.
BTW, I wouldn't have thought you'll have much trouble, run-time wise with a
6nt UMI - I suspect that even if all UMIs are used, UMI-tools still
shouldn't struggle because there are only 4096 possible UMIs.
… |
Thanks a lot for your kind and fast reply. there are 6 arguments does it has a One to one match? 2 if there is only single umi in reads1 or reads2, does it mean I just need to clarify one of the --bc-pattern, umi_tools extract -I $fq1 --bc-pattern=CNNNC --read2-in=$fq2 --stdout=${sampleID}.R1.umi.fq.gz --read2-out=${sampleID}.R2.umi.fq.gz if umi just in reads2, umi_tools extract -I $fq1 --bc-pattern2=CNNNC --read2-in=$fq2 --stdout=${sampleID}.R1.umi.fq.gz --read2-out=${sampleID}.R2.umi.fq.gz does the deafault value of --bc-pattern2 --bc-pattern is NONE? |
1 should I trim adapter before I deal with umi(like tools trimmomatic or fastp or cutadpters), I guess it is a must, but I want to confirm it with you, 2 tools like umi_tools just remove the reads not needed intead of just adding duplicated tag , like tools samtools markdup or picard markdup? |
because I also use ngs data for cnv calling, do you think umi data is needed for cnv(for example the dedup function in umitools), I think I just need to extract the umi data, but not needed to do consensus reads, how do you think ? |
I agree
…On Thu, 17 Jun 2021 at 16:37, worker000000 ***@***.***> wrote:
because I also use ngs data for cnv calling, do you think umi data is
needed for cnv(for example the dedup function in umitools), I think I just
need to extract the umi data, but not needed to do consensus reads, how do
you think ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#477 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJELDQHHIYTZFMV6SUUBYLTTIJCRANCNFSM46ZQGF4A>
.
|
@IanSudbery can you share your opion about my question 1 should I trim adapter before I deal with umi(like tools trimmomatic or fastp or cutadpters), I guess it is a must, but I want to confirm it with you, 2 tools like umi_tools just remove the reads not needed intead of just adding duplicated tag , like tools samtools markdup or picard markdup? |
Dear professor.
Thanks a lot for your powerful tool
1 In paired-end mode, it will ignore the UMI of the second read. so will it affect the accuracy of data, such as false positive
variants just in one strand, why not use both umi, is there any inner reason,
<>
<>
2 my umi is 5 base umi, it is in the 5 end of reads1 and reads2, the first base of umi is always low quality, so it needs to be removed, the last base of umi is a constant base(which is for T/A ligation)
I tried to use umi_tools extract like this
umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I t_1.fq.gz -S out.R1_TMP_umitools.fq.gz --read2-in=t_2.fq.gz --read2-out=out.R2_TMP_umitools.fq.gz
but the header for the mate read in reads1 and reads is like such
<>
@A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 1:N:0:GCAGCTGT+GCTCTAGT
@A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 2:N:0:GCAGCTGT+GCTCTAGT
<>
which is not what I expected, in umi_tools, where C = cell barcode, N = umi, P = plate, X=read sequence, is there any error of my command
<>
<>
<>
<>
3 I want to know if umi_tools is suitable for umi in both reads, because tools like gencore , UMICollapse can not handle umi in both reads, which may arise false postive variant in one strand?
The text was updated successfully, but these errors were encountered: