Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Differentiate between clade defining mutations and optional mutations #15

Open
corneliusroemer opened this issue Mar 23, 2022 · 3 comments

Comments

@corneliusroemer
Copy link

If I understand your script correctly, you treat all mutations that are above the user specified threshold identical.

There's room for improvement there.

It would make sense to use two kinds of mutation types for each clade:

  1. Defining mutations that should be present in (almost) all sequences of a clade, so maybe all those mutations present >95%. If these are absent, it means there's a problem either with sequence quality or something else. Absence is very harmful.
  2. Common mutations that sometimes occur, but whose absence does not mean much. Rather, the presence of these mutations increases the probability of a sequence belonging to the clade.

Do you know what I mean? One threshold does not suffice for both concepts.

I'll think a bit more about recombinant detection myself - maybe there are further improvements possible. This is an amazing tool already, though!

@lenaschimmel
Copy link
Owner

lenaschimmel commented Mar 23, 2022

Yeah, I absolute get what you mean. I know this is not ideal, and having two thresholds would already be a big improvement. Maybe I will change it that way soon.

On the other hand, I have a (still very vague) concept of probability computations in my head, that would be even more powerful and need no hard thresholds at all. It would also affect the way that breakpoints (and intermissions, if they will still exist) are handled and the way the output is displayed. Maybe that's more like version 2.0 of this tool, nothing for the near future.

I'll keep thinking about it!

PS: That probability stuff might be a lot of hard work, but since working on these probability computatons on my previous project Dystonse that doesn't scare me any more.

@corneliusroemer
Copy link
Author

I think I know what you mean, something like max likelihood and/or naive Bayes could be applicable here.

@maciekboni
Copy link

Hi Both - I can walk you through over a zoom call what the Delta_(m,n,2) statistic gets you and how it's constructed. It's non-parametric and you won't need to set thresholds. And, a table of the p-values is pre-built (this is the computationally expensive part) so you just look them up as you need them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants