Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decision_tree_rules.txt rules seem wrong #13

Open
danielmoney opened this issue May 26, 2021 · 12 comments
Open

decision_tree_rules.txt rules seem wrong #13

danielmoney opened this issue May 26, 2021 · 12 comments

Comments

@danielmoney
Copy link

I've been investigating why pangolin was calling some B.1.617 sequences incorrectly as other lineages and as part of this have been looking at the decision_tree_rules.txt file. I ended up comparing the actual decision tree (loaded using joblib) with what was in decision_tree_rules.txt and there seems to be a mismatch.

I'm pretty certain that there is an off-by-one error somewhere. decision_tree_rules has the following as the first few decisions for the first B.1.617.2 rule:
B.1.617.2 24505=='G',18423=='-',26800!='C',28882!='A',16888=='-',25504!='C',24075!='A'
where as I believe it should be:
B.1.617.2 24505=='T',18423=='A',26800!='G',28882!='C',16888=='A',25504!='G',24075!='C'

Given that the bases are ordered -,A,C,G,T in the one hot encoding it seems that every base should be one "higher". Presumably this also means the position is wrong when the actual test is on - as this would, I believe, show in the decision tree as a T at the previous position. It does appear that this affects all the rules not just that B.1.617.2 rule.

Given this is an off by one error I'm wondering if this is related to "lineage" being the first entry in the header array?

I realise that pangolin doesn't actually use decision_tree_rules.txt so this doesn't affect the calls made by pangolin but it will affect anyone trying to work out why pangolin made the call it did.

(As an aside the incorrect calling was due to the presence of an unknown base at a key location which pangolin assigned to be the reference base and which in turn led the assigner into the wrong part of the decision tree).

@jaquol
Copy link

jaquol commented May 31, 2021

We observed the same while trying to determine which mutations define that lineage.

Regardless of the issue, is decision_tree_rules.txt the best place to systematically determine which mutations define each lineage?

@danielmoney
Copy link
Author

My view on this is that no, it's not (but note I'm not a member of the pangolin team) for two reasons:
a) I believe you're likely to miss multiple mutations that occur on the same branch as, at least in theory, the decision tree would only have to use one mutation to make it's decision. (Although also see b!)
b) There's some odd artifacts in the tree which I suspect come about from the replacement of unknown bases with the reference base. What this means is that if, for enough of the samples that is used to train the tree, one of the defining mutations is missing and is replaced with the reference bases then the decision tree could have rules where one of the defining mutations for that lineage is not present (and it calls the lineage based on one of the other defining mutations). I've definitely seen an example where it looks like that is what is happening but I haven't gone and seen what is going on with the learning dataset.

@aineniamh
Copy link
Member

I would definitely not use the decision tree for finding which mutations define each lineage, the decision tree is fragile and as @danielmoney said includes imputed data. Its purpose is for pangolin to use for assignments- outbreak.info is a really great resource that lists the defining snps for a given lineage so would recommend that!

@jaquol
Copy link

jaquol commented Jun 9, 2021

Thank you both @danielmoney @aineniamh for your answers!

@niemasd
Copy link

niemasd commented Aug 25, 2021

I ran into the same thing when attempting to manually assign a lineage based on the decision tree for a sample that Pangolin assigns to B.1.1.222. Taking the decision tree rules in decision_tree_rules.zip at face value incorrectly assigns the sample to B.1.596, but then applying the proposed fix in the first post in this Issue of interpreting A as C, C as G, etc. now correctly results in B.1.1.222. I concur that it seems as though the symbols in decision_tree_rules.zip are off-by-one

@emilyscher
Copy link
Contributor

I think the issue here might be that you're expecting the genome positions to be 0-index when they're in fact 1-indexed (which is normal for genomes, but not normal for if statements, which I know makes things confusing!). Could you let me know if interpreting the positions as 1-indexed fixes your issue?

@danielmoney
Copy link
Author

danielmoney commented Aug 25, 2021

Unless there's been an update to how this file is produced since my original report then, no, 0/1-indexing of the genome is not the problem. When I made the initial report, and as I say in it, the problem was the fact that the bases are effectively out by one, e.g. it has A instead of C, C instead of G etc. I think that does lead to the genomic position being off by one for decisions that are currently "-" (and only in that case) but that is not the underlying cause. Given the time that has past it would take me a little while to be absolutely certain that this is still the case but from @niemasd comment it seems like it is.

@emilyscher
Copy link
Contributor

Unfortunately I'm still not able to recreate this. Could you possibly post your code?

@danielmoney
Copy link
Author

Right lets first get the first feature from decision_tree_rules.txt:
tail -n +4 tree_rules.txt | cut -f2 | cut -f1 -d, | uniq
(strips the three header lines, then gets the decision rules, then gets the first rule and then returns the uniq values of the first rule). For pangoLEARN commit 34f420f this gives us:

16175!='A'
16175=='A'

which clearly suggests the rule is on whether or not 16175 is equal to A.

Now lets do it in code using the joblib files and code I've cobbled together from the pangolin source code:

import joblib

# Load the header and model
header_file = "decisionTreeHeaders_v1.joblib"
model_file = "decisionTree_v1.joblib"

model_headers = joblib.load(header_file)
loaded_model = joblib.load(model_file)

# Load the tree and get the feature id of the first node
tree = loaded_model.tree_
f = tree.feature[0]

# Taken from pangolin/pangolearn/pangolearn.py lines 135, 139-140
# Unless I'm much mistaken this takes a feature id and converts it into
# position_base (seems to be used as column headings for the dataframe that
# is passed to the model prediction)
indiciesToKeep = model_headers[1:]
categories = ['-','A', 'C', 'G', 'T']
columns = [f"{i}_{c}" for i in indiciesToKeep for c in categories]

# Prints the "column" id of the first feature
# This is 16175_C for pangloLEARN commit
# 34f420f44cc03ae059518516c508185bf2b67d91
print(columns[f])

This clearly suggests the feature is actually whether or not 16175 is equal to C (not A).

Looking at a few of these you see the pattern emerge that I mention in my first post. I've not yet been able to track down the actual bug in the code as I don't know the code base well enough.

@niemasd
Copy link

niemasd commented Aug 25, 2021

@emilyscher Are you sure the incides in tree_rules.txt within decision_tree_rules.zip are 1-based? The script I wrote can be found here:

https://github.com/niemasd/CovidQuant/blob/main/CovidQuant.py

All throughout, I'm using 0-based indices for genome positions, but I have a mapping where, if tree_rules.txt says A, I interpret that as C; if tree_rules.txt says C, I interpret that as G; etc.:

https://github.com/niemasd/CovidQuant/blob/main/CovidQuant.py#L25-L31

Here's where I actually use this dictionary:

https://github.com/niemasd/CovidQuant/blob/main/CovidQuant.py#L117

Here's the consensus sequence of the sample I'm playing with (my script is attempting to do some things purely from trimmed BAMs, but the issue can be reproduced with the consensus sequence):

B.1.1.222.txt (Pangolin assigns it to B.1.1.222)

Note that the BAM I'm using in the examples below are mapped to the reference genome that Pangolin uses:

https://github.com/cov-lineages/pangolin/blob/master/pangolin/data/reference.fasta

In the examples below, each line BEST CHOICE: (A, (B, C, D)) can be interpreted as follows:

  • A is the number of reads that support that decision
  • (B, C, D) is the PangoLEARN decision tree rule at that level
    • B is the index from the PangoLEARN decision tree rules
    • C is the comparison from the PangoLEARN decision tree rules (== or !=)
    • D is the symbol from the PangoLEARN decision tree rules
  • Example: BEST CHOICE: (6172, (16175, '!=', 'A')) means that the best choice out of the decisions at the current level was 16175!='A', which had 6172 reads supporting it

If I assume 0-based indexing and take the tree_rules.txt symbols at face value (incorrectly yields B.1.596)

Here's the modified script: CovidQuant_1.py

BEST CHOICE: (6170, (16175, '!=', 'A'))
BEST CHOICE: (1123, (18423, '!=', '-'))
BEST CHOICE: (3409, (23591, '==', 'A'))
[2021-08-25 07:12:30] Best lineage: B.1.596

If I assume 1-based indexing and take the tree_rules.txt symbols at face value (incorrectly yields B.1.596)

Here's the modified script: CovidQuant_2.py

BEST CHOICE: (6172, (16175, '!=', 'A'))
BEST CHOICE: (1123, (18423, '!=', '-'))
BEST CHOICE: (3411, (23591, '!=', 'A'))
BEST CHOICE: (1833, (26304, '!=', 'C'))
BEST CHOICE: (490, (25243, '!=', 'G'))
BEST CHOICE: (1244, (11112, '!=', 'G'))
BEST CHOICE: (4159, (25562, '!=', 'C'))
BEST CHOICE: (854, (14023, '==', 'G'))
[2021-08-25 07:18:54] Best lineage: B.1.596

If I assume 0-based indexing and rotate the symbols in tree_rules.txt (correctly yields B.1.1.222)

By "rotate the symbols", I mean the following:

  • If tree_rules.txt says -, I interpret that as A
  • If tree_rules.txt says A, I interpret that as C
  • If tree_rules.txt says C, I interpret that as G
  • If tree_rules.txt says G, I interpret that as T
  • If tree_rules.txt says T, I interpret that as -

Here's the modified script: CovidQuant_3.py (which is equivalent to CovidQuant.py, but with additional print statements to show the decisions)

BEST CHOICE: (6173, (16175, '!=', 'C'))
BEST CHOICE: (1123, (18423, '==', 'A'))
BEST CHOICE: (346, (21986, '!=', 'A'))
BEST CHOICE: (1908, (26800, '!=', 'G'))
BEST CHOICE: (5816, (28882, '==', 'C'))
BEST CHOICE: (530, (23400, '!=', 'A'))
BEST CHOICE: (4838, (28168, '!=', 'G'))
BEST CHOICE: (1059, (5621, '!=', 'T'))
BEST CHOICE: (109, (24087, '==', 'T'))
BEST CHOICE: (1744, (23730, '!=', 'T'))
BEST CHOICE: (571, (5699, '!=', 'A'))
BEST CHOICE: (1979, (9071, '!=', 'T'))
BEST CHOICE: (416, (2341, '!=', 'G'))
BEST CHOICE: (5279, (15971, '==', 'A'))
BEST CHOICE: (30, (27298, '!=', 'C'))
BEST CHOICE: (343, (23755, '==', 'G'))
BEST CHOICE: (1173, (11116, '!=', 'G'))
BEST CHOICE: (5826, (28876, '==', 'A'))
BEST CHOICE: (9765, (27993, '==', 'A'))
BEST CHOICE: (346, (24196, '!=', 'T'))
BEST CHOICE: (1144, (25349, '!=', '-'))
BEST CHOICE: (2212, (14768, '!=', 'C'))
BEST CHOICE: (144, (4257, '==', 'T'))
BEST CHOICE: (5962, (27505, '==', 'G'))
BEST CHOICE: (3759, (7524, '!=', 'G'))
BEST CHOICE: (108, (25905, '!=', 'T'))
BEST CHOICE: (23, (21573, '!=', 'C'))
[2021-08-25 07:26:26] Best lineage: B.1.1.222

TL;DR

I'm fairly certain that tree_rules.txt within decision_tree_rules.zip is indeed using 0-based indices for genome positions, but that the symbols in the rules are incorrectly rotated by 1

@danielmoney
Copy link
Author

Based on @niemasd comments it would seem I may well be wrong in my assumption that "-" would also have the wrong genome position (although I did say I'd never tested it).

@niemasd
Copy link

niemasd commented Aug 25, 2021

@danielmoney In this specific sample I'm playing with, there happen to not be any gaps at any of the positions in the decision tree conditions that are reached, nor +/- 1 position over (to account for off-by-one in indices), so I think the - corner case is simply not being hit in this specific dataset I'm using. So I'm not sure if your assumption is right or wrong based solely on my dataset (if I'm understanding what you mean)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants