ENH: Differentiate between clade defining mutations and optional mutations #15

corneliusroemer · 2022-03-23T20:11:28Z

If I understand your script correctly, you treat all mutations that are above the user specified threshold identical.

There's room for improvement there.

It would make sense to use two kinds of mutation types for each clade:

Defining mutations that should be present in (almost) all sequences of a clade, so maybe all those mutations present >95%. If these are absent, it means there's a problem either with sequence quality or something else. Absence is very harmful.
Common mutations that sometimes occur, but whose absence does not mean much. Rather, the presence of these mutations increases the probability of a sequence belonging to the clade.

Do you know what I mean? One threshold does not suffice for both concepts.

I'll think a bit more about recombinant detection myself - maybe there are further improvements possible. This is an amazing tool already, though!

lenaschimmel · 2022-03-23T20:23:38Z

Yeah, I absolute get what you mean. I know this is not ideal, and having two thresholds would already be a big improvement. Maybe I will change it that way soon.

On the other hand, I have a (still very vague) concept of probability computations in my head, that would be even more powerful and need no hard thresholds at all. It would also affect the way that breakpoints (and intermissions, if they will still exist) are handled and the way the output is displayed. Maybe that's more like version 2.0 of this tool, nothing for the near future.

I'll keep thinking about it!

PS: That probability stuff might be a lot of hard work, but since working on these probability computatons on my previous project Dystonse that doesn't scare me any more.

corneliusroemer · 2022-03-23T21:37:31Z

I think I know what you mean, something like max likelihood and/or naive Bayes could be applicable here.

maciekboni · 2022-03-26T19:17:07Z

Hi Both - I can walk you through over a zoom call what the Delta_(m,n,2) statistic gets you and how it's constructed. It's non-parametric and you won't need to set thresholds. And, a table of the p-values is pre-built (this is the computationally expensive part) so you just look them up as you need them.

lenaschimmel mentioned this issue Mar 26, 2022

Have a look at the Δm, n, 2 statistic for breakpoint detection #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Differentiate between clade defining mutations and optional mutations #15

ENH: Differentiate between clade defining mutations and optional mutations #15

corneliusroemer commented Mar 23, 2022

lenaschimmel commented Mar 23, 2022 •

edited

Loading

corneliusroemer commented Mar 23, 2022

maciekboni commented Mar 26, 2022

ENH: Differentiate between clade defining mutations and optional mutations #15

ENH: Differentiate between clade defining mutations and optional mutations #15

Comments

corneliusroemer commented Mar 23, 2022

lenaschimmel commented Mar 23, 2022 • edited Loading

corneliusroemer commented Mar 23, 2022

maciekboni commented Mar 26, 2022

lenaschimmel commented Mar 23, 2022 •

edited

Loading