-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can umicollapse be used for single umi and duplex umi data #10
Comments
always we have data of fastq, and I saw this tools supply fastq and bam mode can you share a full command of this? <> hows should I prepare this bam? first extract umi from fastq and bwa mem and sort? or just ** bwa mem and sort?** |
FASTQ and BAM data are processed differently. FASTQ data is deduplicated based on the entire read. This mode does not support paired-end reads. This mode is used to deduplicate data without having to align to a reference. Aligning first is time-consuming, but it may give better results. Here is an example of the merge flag:
However, this is redundant because by default avgqual is used for FASTQ mode. This means that the read with the highest average quality score is the only one that is output when collapsing a group of multiple reads. For BAM mode, two major steps have to be done before running UMICollapse:
UMICollapse does not do these two steps. You should follow the instructions here: https://umi-tools.readthedocs.io/en/latest/QUICK_START.html Let me know if you have any other questions. Also I'm not a professor. |
Thanks a lot for your qucik and help reply, I have some other question. |
For the first 3 questions: How the UMI is preprocessed is not handled by UMICollapse. You will have to extract the UMIs from the reads, remove the first base, and put this cleaned up UMI in the read header before alignment. The UMI in the header is what is used by UMICollapse. The only thing that is autodetected is the length of this UMI in the header. UMI-tools provides a way to extract UMIs, ignore bases, and put them in the header, based on a certain pattern. For the fourth question, newer tools are probably better, but I'm not sure. |
Thanks a lot for your kind and fast reply. <> I tried to use umi_tools extract like this |
It seems that 2 is answered by the excellent maintainers behind UMI-tools. For 1, UMICollapse was originally created as a proof of concept for better algorithms for deduplicating many, many UMIs. This meant that not all features were implemented, only the most important ones with single-end sequences. Later, more features were added due to user request, but it is still not as feature-complete as UMI-tools, which has existed for a long time. I would recommend UMICollapse only for cases where they encounter issues with other tools on massive datasets. I agree with the UMI-tools maintainers that with only 6bp UMIs, there wouldn't be a lot of UMIs to deduplicate. For your case, if you really wanted to use UMICollapse, there is a workaround where you extract the UMIs from both reads and place them in the header of the first read, then deduplicate using paired end mode. |
thanks a lot, so I need to remove the umi in the reads2, is that right? |
Ideally, you would remove the UMI from read2 and concatenate it to the UMI of read1 (to form a 6bp UMI) and place this UMI in the read1 FASTQ headers. |
is there any tools to remove this effiently, thanks a lot |
Perhaps you can do it with UMI-tools? They have a way of extracting UMIs from read1 and read2 and putting them in the respective headers. If you want to concatenate the UMIs and put them in the read1 header, then you may have to write a simple script to do it. I don't think UMI-tools can handle that. |
thanks a lot, do you mean do as following? umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I 28_1.fq.gz -S R1_TMP_umitools.fq.gz --read2-in=28_2.fq.gz --read2-out=R2_TMP_umitools.fq.gz delete cell_code and umi from fq2zless R2_TMP_umitools.fq.gz | sed -r 's#(@[^_]+)_[^ ]+( 2:N:0)#\1\2#' | pigz - > n2.fq.gz bwa and samtools./umicollapse bam -i paired_example.bam -o dedup_paired_example.bam --umi-sep _ --paired --two-pass |
can you have a look of my issuses in this CGATOxford/UMI-tools#477 , thanks a lot |
Are you removing the UMIs in the FASTQ headers for read2? You do not need to do that. You only need to extract the UMI from the read sequences so it does not interfere with alignment (this was what I meant by "remove" in my previous comments; the UMIs need to be removed from the sequences, but not the headers). UMICollapse simply ignores the UMIs that are in the headers of the read2 FASTQ files, so there is no need to remove them. Sorry, I was not very clear on this. I hate to say this but I can't write your pipeline for you. I can only provide help related to this tool, so if you have more general concerns I suggest asking on biostars or something. |
Thanks a lot, so your meanning is that, do as pair-end mode, |
When you pass in the |
Thanks a lot fot your helpful answer |
Daer professor,
Thanks a lot for making such a super tool, can it be used for single umi and duplex umi data?
The text was updated successfully, but these errors were encountered: