Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: NextGenSeqUtils.deMux(): Seperate error by gene specific region of primer and barcode region of primer #5

Open
dnk8n opened this issue Dec 5, 2018 · 1 comment

Comments

@dnk8n
Copy link

dnk8n commented Dec 5, 2018

Terminology:

Since I am new to bioinformatics, I am going to go over the terminology as I understand it.

example (dummy data), forward primer:

GATAGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
*****------------------------------

where * is the barcode (differentiates samples)
where - is the gene-specific region (selects which region gets amplified)

Currently error is calculated over the entire primer (barcode + gene-specific region).

It would be nice to be able to have the option to enter forward/reverse primers seperated into regions, e.g. instead of:

forwardPrimers = [
    "GATAGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT",
    "TCTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT",
    "CATCATTTTTTTTTTTTTTTTTTTTTTTTTTTTTT",
    "TGCATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"
]
reversePrimers = [
    "CTGTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
    "CTGTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
    "CTGTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
    "CTGTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
]

we could pass in:

forwardPrimers = [
    ["GATAG", "TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"],
    ["TCTAC", "TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"],
    ["CATCA", "TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"],
    ["TGCAT", "TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"]
]

reversePrimers = [
    ["CTGTC", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"],
    ["CTGTC", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"],
    ["CTGTC", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"],
    ["CTGTC", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"]
]

forwardNames = ["1", "3", "4", "5"]

reverseNames = ["A", "A", "A", "A"]

sequences = [
    "TGCATTTTTTTTTTTTTTTTGGGGGGTTTTTTTTTAAAAATTGGGGGGTTTTTTTTTTTTTTTTTTTTTTGACAG",
    "CTGTCAAAAAAAAAAAAAAAAAAAAAACCCCCCAATTTTTAAAAAAAAACCCCCCAAAAAAAAAAAAAAAATGCA"
]

and get back an output like this:

[
    ["5_A", "TGCATTTTTTTTTTTTTTTTGGGGGGTTTTTTTTTAAAAATTGGGGGGTTTTTTTTTTTTTTTTTTTTTTGACAG", false, [0, -6.0]],
    ["5_A", "TGCATTTTTTTTTTTTTTTTGGGGGGTTTTTTTTTAAAAATTGGGGGGTTTTTTTTTTTTTTTTTTTTTTGACAG", true, [0, -6.0]]
]

This way we can accept a higher error in the gene-specific region, but can still demultiplex according to the barcode error (with a lower tolerance for error).

One step further, it might be nice to just return all error and let the user make decisions based on it, e.g.:

[
    <forward barcode error>,
    <forward gene-specific region error>,
    <reverse barcode error>,
    <reverse gene-specific region error>
]

As I said, pretty new to this. Something I think would be useful. Happy to have a conversation and be put onto alternate solutions.

I am happy to dig into some Julia and work on this if it is decided that this would be a useful feature.

@dnk8n dnk8n changed the title [Feature Request]: Seperate error by gene specific region of primer and barcode region of primer [Feature Request]: NextGenSeqUtils.deMux(): Seperate error by gene specific region of primer and barcode region of primer Dec 5, 2018
@murrellb
Copy link
Member

murrellb commented Dec 5, 2018

I agree that allowing different thresholds for barcode vs rest-of-primer is useful, because it lets you tolerate errors in the long barcode region. Note that how I currently work around this issue is I don't include all of the non-barcode part of the primer - I include just enough bases to distinguish FWD and REV, which also lowers the computational cost of the demux step, because it requires shorter alignments.

There are other benefits of matching the entire primer though. The main one is that we often want to trim off the primer itself, which is something I haven't done robustly. I'd recommend considering this together with the multiple-region primer match score, because you need info from the primer match to get the trimming right.

Please jump in. You'll likely want to build it around the IUPAC_nuc_nw_align in demux.jl. - you could do a single alignment, and then compute the scores for the different bits based on that alignment. You'll have to make an arbitrary decision about where to place an insertion that happens between the separated segments of the primer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants