-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
detecting doublets whose cells came from the same sample #1
Comments
ok, after analyzing the results of the simulation, I've come to the conclusion that this issue is actually something we need to be worried about
|
From my understanding, are we only concerned with UMI collisions in simulated doublets? If so can we go in and check the UMIs for the doublets only, and change them if necessary. |
@zrcjessica you're right! Now all I have to do is find a good hash function... It needs to take the cellular and molecular barcodes as input and return a new UMI consisting of only A, G, C, T. |
Another potential solution - how many cases of redundant UMIs do we have across all of the original .bam files? If it's not very many, we could simply drop those cells from this simulation. |
It's hard to know because there are too many molecular barcodes to check, but my gut tells me there will probably be conflicts. Every time I tried to get a unique list of the molecular barcodes (in order to know how many duplicates there are), my job used up too much memory on the cluster and got killed. So I think there are just too many barcodes to check. But my intuition tells me there will probably be a non-negligible number of conflicts. Every bead in every droplet was probably created identically save for the cellular barcode. This means the unique molecular identifiers of every bead are probably the same. Granted, only some (not all) of every bead's molecular identifiers actually get used and sequenced, but still, I'm gonna guess that there will be quite a few collisions. |
I came up with a hash function in case we want to do it this way:
In |
ok, Graham acknowledged that this might be a problem, but that it doesn't matter because we're going to be throwing out the doublets when we use demuxlet anyway |
Ok, I'm actually going to reopen this issue to keep track of work for our new idea: discovering doublets whose cells come from the same sample. The first order of business is to create some sort of a histogram for the number of UMIs across each droplet, colored by doublets vs singlets. |
I like to use Github issues as TODO lists. First item of business: conflicting UMIs.
The problem
UMIs are used by
demuxlet
and appear in theUB
tags within the BAM file. When we create doublets, there is the possibility of duplicate UMIs appearing within the same doublet.Is this a cause for concern?
Probably.
demuxlet
might discard duplicate UMIs or otherwise treat duplicates differently.But it's hard to say how prevalent this problem will be until we actually encounter it.
The solution
Not sure about this yet. Ideally, the
new_bam.py
script that changes the CB tags could change the UMIs in a way that prevents them from conflicting. But I'm not sure how we would do this given our current workflow design:new_bam.py
is called once for each sample's BAM file; they aren't considered in tandem.Another idea would be to merge the BAM files together and then fix the duplicate UMIs after that. That might be fastest, but it would probably involve extra steps, like sorting the merged BAM by UB tag (ugh).
The text was updated successfully, but these errors were encountered: