Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can umicollapse be used for single umi and duplex umi data #10

Open
worker000000 opened this issue Jun 15, 2021 · 16 comments
Open

can umicollapse be used for single umi and duplex umi data #10

worker000000 opened this issue Jun 15, 2021 · 16 comments

Comments

@worker000000
Copy link

Daer professor,
Thanks a lot for making such a super tool, can it be used for single umi and duplex umi data?

@worker000000
Copy link
Author

always we have data of fastq, and I saw this tools supply fastq and bam mode
<>
--merge: method for identifying which UMI to keep out of every two UMIs. Either any, avgqual, or mapqual. Default: mapqual for SAM/BAM mode, avgqual for FASTQ mode.
<>

can you share a full command of this?

<>
<>
--paired: use paired-end mode, which deduplicates pairs of reads from a SAM/BAM file. The template length of each read pair, along with the alignment coordinate and UMI of the forwards read, are used to deduplicate read pairs. This is very memory intensive, and the input SAM/BAM files should be sorted. Default: false (single-end).

hows should I prepare this bam? first extract umi from fastq and bwa mem and sort? or just ** bwa mem and sort?**

@Daniel-Liu-c0deb0t
Copy link
Owner

Daniel-Liu-c0deb0t commented Jun 15, 2021

FASTQ and BAM data are processed differently.

FASTQ data is deduplicated based on the entire read. This mode does not support paired-end reads. This mode is used to deduplicate data without having to align to a reference. Aligning first is time-consuming, but it may give better results.

Here is an example of the merge flag:

./umicollapse fastq -i input.fastq -o output.fastq --merge avgqual

However, this is redundant because by default avgqual is used for FASTQ mode. This means that the read with the highest average quality score is the only one that is output when collapsing a group of multiple reads.

For BAM mode, two major steps have to be done before running UMICollapse:

  1. Extract UMIs from reads in FASTQ format and add it to the headers
  2. Align reads to get a BAM file

UMICollapse does not do these two steps. You should follow the instructions here: https://umi-tools.readthedocs.io/en/latest/QUICK_START.html
The only difference is using UMICollapse instead of UMI-tools. UMICollapse should be much faster than UMI-tools, but it should produce very similar results.

Let me know if you have any other questions. Also I'm not a professor.

@worker000000
Copy link
Author

worker000000 commented Jun 16, 2021

Thanks a lot for your qucik and help reply, I have some other question.
1 can umicollapse be used for single umi and duplex umi data
2 how the autodetect model work and is it correct enough?
3 in my umi mode, a duplex umi(both reads have umi), and the first base of the umi is not in good sequence quality, so we ignore it , and the next three base is my umi of three bases, and the next the the base T (for T A ligation), can I use antodetect mode?
4 in the UMI-tools, it used bowtie, is it better than bowtie2 and bwa?

@Daniel-Liu-c0deb0t
Copy link
Owner

For the first 3 questions:
UMICollapse can handle single UMIs. In paired-end mode, it will ignore the UMI of the second read.

How the UMI is preprocessed is not handled by UMICollapse. You will have to extract the UMIs from the reads, remove the first base, and put this cleaned up UMI in the read header before alignment. The UMI in the header is what is used by UMICollapse. The only thing that is autodetected is the length of this UMI in the header. UMI-tools provides a way to extract UMIs, ignore bases, and put them in the header, based on a certain pattern.

For the fourth question, newer tools are probably better, but I'm not sure.

@worker000000
Copy link
Author

Thanks a lot for your kind and fast reply.
1 In paired-end mode, it will ignore the UMI of the second read. so will it affect the accuracy of data, such as false positive
variants just in one strand, why not use both umi, is there any inner reason,

<>
<>
2 my umi is 5 base umi, it is in the 5 end of reads1 and reads2, the first base of umi is always low quality, so it needs to be removed, the last base of umi is a constant base(which is for T/A ligation)

I tried to use umi_tools extract like this
umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I t_1.fq.gz -S out.R1_TMP_umitools.fq.gz --read2-in=t_2.fq.gz --read2-out=out.R2_TMP_umitools.fq.gz
but the header for the mate read in reads1 and reads is like such
<>
@A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 1:N:0:GCAGCTGT+GCTCTAGT
@A00582:632:H7F23DSX2:3:1101:4399:1251_TTCT_ACCTAA 2:N:0:GCAGCTGT+GCTCTAGT
<>
which is not what I expected, in umi_tools, where C = cell barcode, N = umi, P = plate, X=read sequence, is there any error of my command
<>
<>

image

@Daniel-Liu-c0deb0t
Copy link
Owner

It seems that 2 is answered by the excellent maintainers behind UMI-tools.

For 1, UMICollapse was originally created as a proof of concept for better algorithms for deduplicating many, many UMIs. This meant that not all features were implemented, only the most important ones with single-end sequences. Later, more features were added due to user request, but it is still not as feature-complete as UMI-tools, which has existed for a long time. I would recommend UMICollapse only for cases where they encounter issues with other tools on massive datasets. I agree with the UMI-tools maintainers that with only 6bp UMIs, there wouldn't be a lot of UMIs to deduplicate.

For your case, if you really wanted to use UMICollapse, there is a workaround where you extract the UMIs from both reads and place them in the header of the first read, then deduplicate using paired end mode.

@worker000000
Copy link
Author

thanks a lot, so I need to remove the umi in the reads2, is that right?

@Daniel-Liu-c0deb0t
Copy link
Owner

Ideally, you would remove the UMI from read2 and concatenate it to the UMI of read1 (to form a 6bp UMI) and place this UMI in the read1 FASTQ headers.

@worker000000
Copy link
Author

is there any tools to remove this effiently, thanks a lot

@Daniel-Liu-c0deb0t
Copy link
Owner

Perhaps you can do it with UMI-tools? They have a way of extracting UMIs from read1 and read2 and putting them in the respective headers.

If you want to concatenate the UMIs and put them in the read1 header, then you may have to write a simple script to do it. I don't think UMI-tools can handle that.

@worker000000
Copy link
Author

thanks a lot, do you mean do as following?

umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I 28_1.fq.gz -S R1_TMP_umitools.fq.gz --read2-in=28_2.fq.gz --read2-out=R2_TMP_umitools.fq.gz

delete cell_code and umi from fq2

zless R2_TMP_umitools.fq.gz | sed -r 's#(@[^_]+)_[^ ]+( 2:N:0)#\1\2#' | pigz - > n2.fq.gz

bwa and samtools

./umicollapse bam -i paired_example.bam -o dedup_paired_example.bam --umi-sep _ --paired --two-pass

@worker000000
Copy link
Author

can you have a look of my issuses in this CGATOxford/UMI-tools#477 , thanks a lot

@Daniel-Liu-c0deb0t
Copy link
Owner

Are you removing the UMIs in the FASTQ headers for read2? You do not need to do that. You only need to extract the UMI from the read sequences so it does not interfere with alignment (this was what I meant by "remove" in my previous comments; the UMIs need to be removed from the sequences, but not the headers). UMICollapse simply ignores the UMIs that are in the headers of the read2 FASTQ files, so there is no need to remove them. Sorry, I was not very clear on this.

I hate to say this but I can't write your pipeline for you. I can only provide help related to this tool, so if you have more general concerns I suggest asking on biostars or something.

@worker000000
Copy link
Author

Are you removing the UMIs in the FASTQ headers for read2? You do not need to do that. You only need to extract the UMI from the read sequences so it does not interfere with alignment (this was what I meant by "remove" in my previous comments; the UMIs need to be removed from the sequences, but not the headers). UMICollapse simply ignores the UMIs that are in the headers of the read2 FASTQ files, so there is no need to remove them. Sorry, I was not very clear on this.

I hate to say this but I can't write your pipeline for you. I can only provide help related to this tool, so if you have more general concerns I suggest asking on biostars or something.

Thanks a lot, so your meanning is that, do as pair-end mode,
so when you said just use umi consensus of reads1, after you remove many error reads in reads1, how will you treat the mate reads in reads2, I am curious about this

@Daniel-Liu-c0deb0t
Copy link
Owner

Daniel-Liu-c0deb0t commented Jun 21, 2021

When you pass in the --paired flag, any read1 that is removed will cause its corresponding read2 to be removed too. (Same behavior as UMI-tools)

@worker000000
Copy link
Author

When you pass in the --paired flag, any read1 that is removed will cause its corresponding read2 to be removed too. (Same behavior as UMI-tools)

Thanks a lot fot your helpful answer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants