-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
decision_tree_rules.txt rules seem wrong #13
Comments
We observed the same while trying to determine which mutations define that lineage. Regardless of the issue, is |
My view on this is that no, it's not (but note I'm not a member of the pangolin team) for two reasons: |
I would definitely not use the decision tree for finding which mutations define each lineage, the decision tree is fragile and as @danielmoney said includes imputed data. Its purpose is for pangolin to use for assignments- outbreak.info is a really great resource that lists the defining snps for a given lineage so would recommend that! |
Thank you both @danielmoney @aineniamh for your answers! |
I ran into the same thing when attempting to manually assign a lineage based on the decision tree for a sample that Pangolin assigns to B.1.1.222. Taking the decision tree rules in |
I think the issue here might be that you're expecting the genome positions to be 0-index when they're in fact 1-indexed (which is normal for genomes, but not normal for if statements, which I know makes things confusing!). Could you let me know if interpreting the positions as 1-indexed fixes your issue? |
Unless there's been an update to how this file is produced since my original report then, no, 0/1-indexing of the genome is not the problem. When I made the initial report, and as I say in it, the problem was the fact that the bases are effectively out by one, e.g. it has A instead of C, C instead of G etc. I think that does lead to the genomic position being off by one for decisions that are currently "-" (and only in that case) but that is not the underlying cause. Given the time that has past it would take me a little while to be absolutely certain that this is still the case but from @niemasd comment it seems like it is. |
Unfortunately I'm still not able to recreate this. Could you possibly post your code? |
Right lets first get the first feature from decision_tree_rules.txt:
which clearly suggests the rule is on whether or not 16175 is equal to A. Now lets do it in code using the joblib files and code I've cobbled together from the pangolin source code:
This clearly suggests the feature is actually whether or not 16175 is equal to C (not A). Looking at a few of these you see the pattern emerge that I mention in my first post. I've not yet been able to track down the actual bug in the code as I don't know the code base well enough. |
@emilyscher Are you sure the incides in https://github.com/niemasd/CovidQuant/blob/main/CovidQuant.py All throughout, I'm using 0-based indices for genome positions, but I have a mapping where, if https://github.com/niemasd/CovidQuant/blob/main/CovidQuant.py#L25-L31 Here's where I actually use this dictionary: https://github.com/niemasd/CovidQuant/blob/main/CovidQuant.py#L117 Here's the consensus sequence of the sample I'm playing with (my script is attempting to do some things purely from trimmed BAMs, but the issue can be reproduced with the consensus sequence): B.1.1.222.txt (Pangolin assigns it to B.1.1.222) Note that the BAM I'm using in the examples below are mapped to the reference genome that Pangolin uses: https://github.com/cov-lineages/pangolin/blob/master/pangolin/data/reference.fasta In the examples below, each line
If I assume 0-based indexing and take the
|
Based on @niemasd comments it would seem I may well be wrong in my assumption that "-" would also have the wrong genome position (although I did say I'd never tested it). |
@danielmoney In this specific sample I'm playing with, there happen to not be any gaps at any of the positions in the decision tree conditions that are reached, nor +/- 1 position over (to account for off-by-one in indices), so I think the |
I've been investigating why pangolin was calling some B.1.617 sequences incorrectly as other lineages and as part of this have been looking at the decision_tree_rules.txt file. I ended up comparing the actual decision tree (loaded using joblib) with what was in decision_tree_rules.txt and there seems to be a mismatch.
I'm pretty certain that there is an off-by-one error somewhere. decision_tree_rules has the following as the first few decisions for the first B.1.617.2 rule:
B.1.617.2 24505=='G',18423=='-',26800!='C',28882!='A',16888=='-',25504!='C',24075!='A'
where as I believe it should be:
B.1.617.2 24505=='T',18423=='A',26800!='G',28882!='C',16888=='A',25504!='G',24075!='C'
Given that the bases are ordered -,A,C,G,T in the one hot encoding it seems that every base should be one "higher". Presumably this also means the position is wrong when the actual test is on - as this would, I believe, show in the decision tree as a T at the previous position. It does appear that this affects all the rules not just that B.1.617.2 rule.
Given this is an off by one error I'm wondering if this is related to "lineage" being the first entry in the header array?
I realise that pangolin doesn't actually use decision_tree_rules.txt so this doesn't affect the calls made by pangolin but it will affect anyone trying to work out why pangolin made the call it did.
(As an aside the incorrect calling was due to the presence of an unknown base at a key location which pangolin assigned to be the reference base and which in turn led the assigner into the wrong part of the decision tree).
The text was updated successfully, but these errors were encountered: