Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detecting doublets whose cells came from the same sample #1

Open
aryarm opened this issue Aug 4, 2020 · 8 comments
Open

detecting doublets whose cells came from the same sample #1

aryarm opened this issue Aug 4, 2020 · 8 comments

Comments

@aryarm
Copy link
Collaborator

aryarm commented Aug 4, 2020

I like to use Github issues as TODO lists. First item of business: conflicting UMIs.

The problem

UMIs are used by demuxlet and appear in the UB tags within the BAM file. When we create doublets, there is the possibility of duplicate UMIs appearing within the same doublet.

Is this a cause for concern?

Probably. demuxlet might discard duplicate UMIs or otherwise treat duplicates differently.
But it's hard to say how prevalent this problem will be until we actually encounter it.

The solution

Not sure about this yet. Ideally, the new_bam.py script that changes the CB tags could change the UMIs in a way that prevents them from conflicting. But I'm not sure how we would do this given our current workflow design: new_bam.py is called once for each sample's BAM file; they aren't considered in tandem.
Another idea would be to merge the BAM files together and then fix the duplicate UMIs after that. That might be fastest, but it would probably involve extra steps, like sorting the merged BAM by UB tag (ugh).

@aryarm
Copy link
Collaborator Author

aryarm commented Aug 22, 2020

ok, after analyzing the results of the simulation, I've come to the conclusion that this issue is actually something we need to be worried about

# how many doublets did we create with the same sample?
len(list(filter(lambda samp: samp[0] == samp[1], truth[truth['type'] == 'DBL']['sample'])))
979
# and how many doublets were predicted to have the same sample?
len(list(filter(lambda samp: samp[0] == samp[1], predicts[predicts['type'] == 'DBL']['sample'])))
0

@zrcjessica
Copy link
Owner

From my understanding, are we only concerned with UMI collisions in simulated doublets? If so can we go in and check the UMIs for the doublets only, and change them if necessary.

@aryarm
Copy link
Collaborator Author

aryarm commented Aug 23, 2020

@zrcjessica you're right!
The bam file is too big to load into memory all at once and check for conflicting barcodes among the doublets, but I can try to hash the old cellular barcode with the UMI, and since at least one of the two is guaranteed to be unique at any given time, I should get a new unique UMI that works!

Now all I have to do is find a good hash function... It needs to take the cellular and molecular barcodes as input and return a new UMI consisting of only A, G, C, T.
hmmm...

@zrcjessica
Copy link
Owner

Another potential solution - how many cases of redundant UMIs do we have across all of the original .bam files? If it's not very many, we could simply drop those cells from this simulation.

@aryarm
Copy link
Collaborator Author

aryarm commented Aug 23, 2020

It's hard to know because there are too many molecular barcodes to check, but my gut tells me there will probably be conflicts.

Every time I tried to get a unique list of the molecular barcodes (in order to know how many duplicates there are), my job used up too much memory on the cluster and got killed. So I think there are just too many barcodes to check.

But my intuition tells me there will probably be a non-negligible number of conflicts. Every bead in every droplet was probably created identically save for the cellular barcode. This means the unique molecular identifiers of every bead are probably the same. Granted, only some (not all) of every bead's molecular identifiers actually get used and sequenced, but still, I'm gonna guess that there will be quite a few collisions.

@aryarm
Copy link
Collaborator Author

aryarm commented Aug 24, 2020

I came up with a hash function in case we want to do it this way:

def hash_ub(ub, cb):
    """
        return a hash of the molecular barcode (ub)
        using the cellular barcode (cb) as salt
    """
    # Convert bases to integers
    ub = [('A','C','G','T').index(ch) for ch in ub]
    cb = [('A','C','G','T').index(ch) for ch in cb]

    # The cb is usually longer, so we'll just ignore the extra
    # To retain the uniqueness of the cb we add the end chars
    # to every nth character in the new, shortened cb (where
    # n is the number of extra characters at the end of cb)
    extra_len = len(cb)-len(ub)
    for i in range(extra_len):
        cb[i*int(len(ub)/extra_len)] += cb[:-(i+1)]

    # Finally, we can add the salt (the cb characters)
    for i in range(len(ub)):
        # 1) add the cb[i] value
        # 2) mod by 4 to ensure 0 <= ub[i] <= 3
        # 3) convert back from integer to ACGT
        ub[i] = ('A','C','G','T')[(ub[i] + cb[i]) % 4]
    # Join the bases together into a string
    return "".join(ub)

In new_bam.py, we can apply this function to every read that's going into a simulated doublet in order to get new, unique, molecular barcodes.
The function as it is written cannot guarantee 100% uniqueness. I'm not smart enough to figure out how to do that. (It might even be impossible given the problem as stated?) But I think this is good enough. It will at least reduce the number of conflicts substantially, I think hope.

@aryarm
Copy link
Collaborator Author

aryarm commented Aug 24, 2020

ok, Graham acknowledged that this might be a problem, but that it doesn't matter because we're going to be throwing out the doublets when we use demuxlet anyway
also, he doesn't think demuxlet can identify doublets whose cells come from the same sample, but he's interested in the idea of us doing this ourselves after running demuxlet (perhaps using a statistical test to check for an outsized number of UMIs)

@aryarm aryarm closed this as completed Aug 24, 2020
@aryarm
Copy link
Collaborator Author

aryarm commented Aug 24, 2020

Ok, I'm actually going to reopen this issue to keep track of work for our new idea: discovering doublets whose cells come from the same sample.

The first order of business is to create some sort of a histogram for the number of UMIs across each droplet, colored by doublets vs singlets.

@aryarm aryarm reopened this Aug 24, 2020
@aryarm aryarm changed the title conflicting UMIs detecting doublets whose cells came from the same sample Aug 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants