Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node synonymizer not a superset of SRI node normalizer #2408

Open
dkoslicki opened this issue Oct 30, 2024 · 7 comments
Open

Node synonymizer not a superset of SRI node normalizer #2408

dkoslicki opened this issue Oct 30, 2024 · 7 comments
Assignees

Comments

@dkoslicki
Copy link
Member

As @stephwon has been investigating how to integrate Gwenlyn's microbiomeKG with our metagenomicsKG (derived from RTX-KG2), the following issue was encountered again: mismatches between the SRI node normalizer (NN) and our node synonymizer (NS). In particular, we have a number of examples of CURIES that NN recognizes that the NS does not. These include (thanks Steph):

NCBIGene:114081710
NCBIGene:125173501
EC:3.1.12.1
EC:1.6.1.1
PANTHER.FAMILY:PTHR10221
PANTHER.FAMILY:PTHR11157:SF19
RHEA:18397
FB:FBgn0002557
UNII:2BJ58BX78W
UNII:4796L35A70
PANTHER.PATHWAY:P02728
PANTHER.PATHWAY:P02772
GTOPDB:4998
GTOPDB:9606
EUPATH:0009259
DRUGBANK:DB14800
HGNC.FAMILY:1269

A number of these are for nodes that are not in KG2 proper (eg. non-human orthologs). But others are just absent from the NS (compare GTOPDB:4998 with the NN and the NS at arax.ci.transltr.io and the /entity endpoint).

At the 10/30/2024 AHM, the suggestion was made to fail over to to the NN if the NS doesn't know about a particular CURIE. This would definitely lead to more completeness, but might hide some opportunity for nuanced conflations (eg. do we want to distinguish between non-human and human orthologs?).

Thoughts?

@chunyuma
Copy link
Collaborator

Just to point out a potential issue but would like to hear @amykglen's professional insight on it:

Currently, KG2c's preferred node is associated with node synonymizer. If we finally plan to merge all contents from the NN into the NS, we need to make sure that the preferred node returned by a node synonymizer for an bioentity should be in the KG2. For example, node entities that NN recognizes that the NS does not are probably because they are not in the KG2. If they have some synonyms in the KG2, I think we should use those sysnonyms in the KG2 as the preferred cure in NS.

@amykglen
Copy link
Member

so I think these curies are not recognized by the NS because they are both 1) not in KG2 and 2) are not equivalent to a curie in KG2 (according to the NN). in other words, the NS includes all curies in the NN that are in clusters that intersect with KG2. or, in even different words, the NS is a superset of the NN, but only for clusters involving nodes in KG2.

we have always limited the NS to KG2's contents in some fashion, because the NN is kind of huge - 550 million curies last I checked (250 million of those are Proteins and 200 million are SmallMolecules). if we wanted to expand the NS to include all curies in the NN, it'd take some experimentation - this would greatly slow down the NS's build process and result in a much larger NS sqlite file. I'm thinking it'd probably be better to just query the NN API in realtime for curies that the NS doesn't recognize (in my experience their API is very fast and reliable).

to @chunyuma's point - we're required by Translator to use the NN's preferred curies, so the way we currently handle this situation is like this: say CURIE:A is present in KG2pre but the preferred curie for its cluster according to the NN is CURIE:X, which is not present in KG2pre. in this situation the NS still extracts CURIE:X from the NN and reports the preferred curie for this cluster as CURIE:X, which means that CURIE:X becomes the identifier for this node in KG2c (since the KG2c build uses the NS to determine the preferred ids and equivalent_curies), even though CURIE:X wasn't present in KG2pre. that make sense? so basically KG2c node ids are always the NN's preferred identifiers, for clusters overlapping with the NN.

@dkoslicki
Copy link
Member Author

the NS is a superset of the NN, but only for clusters involving nodes in KG2.

Ah, that explains my confusion, thanks for clarifying!

I'm thinking it'd probably be better to just query the NN API in realtime for curies that the NS doesn't recognize

Yes, that is what I was/we were thinking too: just use NN as a fall back, and somewhere indicate this node is not in KG2

@amykglen
Copy link
Member

@dkoslicki - I'm a little unclear about this line of yours:

A number of these are for nodes that are not in KG2 proper (eg. non-human orthologs). But others are just absent from the NS (compare GTOPDB:4998 with the NN and the NS at arax.ci.transltr.io and the /entity endpoint).

from what I can see, GTOPDB:4998 is neither in KG2 nor in the NS. see this page, which corresponds to what is in the NS: https://arax.ci.transltr.io/?term=GTOPDB:4998. (or also this: https://arax.ci.transltr.io/api/arax/v1.4/entity?q=GTOPDB%3A4998) I think this is true for all curies reported here, yeah? (i.e., they are absent from KG2 and the NS) please correct me if I'm wrong!

@chunyuma
Copy link
Collaborator

we're required by Translator to use the NN's preferred curies, so the way we currently handle this situation is like this: say CURIE:A is present in KG2pre but the preferred curie for its cluster according to the NN is CURIE:X, which is not present in KG2pre. in this situation the NS still extracts CURIE:X from the NN and reports the preferred curie for this cluster as CURIE:X, which means that CURIE:X becomes the identifier for this node in KG2c (since the KG2c build uses the NS to determine the preferred ids and equivalent_curies), even though CURIE:X wasn't present in KG2pre. that make sense? so basically KG2c node ids are always the NN's preferred identifiers, for clusters overlapping with the NN.

Ah, that makes it easier to include NN in NS by calling API. It will not affect NS too much. Thank you for letting us know, Amy!

@dkoslicki
Copy link
Member Author

@dkoslicki - I'm a little unclear about this line of yours:

A number of these are for nodes that are not in KG2 proper (eg. non-human orthologs). But others are just absent from the NS (compare GTOPDB:4998 with the NN and the NS at arax.ci.transltr.io and the /entity endpoint).

from what I can see, GTOPDB:4998 is neither in KG2 nor in the NS. see this page, which corresponds to what is in the NS: https://arax.ci.transltr.io/?term=GTOPDB:4998. (or also this: https://arax.ci.transltr.io/api/arax/v1.4/entity?q=GTOPDB%3A4998) I think this is true for all curies reported here, yeah? (i.e., they are absent from KG2 and the NS) please correct me if I'm wrong!

This was from my misunderstanding that the NS is only a superset of the NN for those nodes in KG2. GTOPDB:4998 was an example of one node that was in the NN, but not in KG2, so confused me as I thought that the NS was a superset of NN regardless of being in KG2 or not.

And yes, without having checked all of them, I assume they are all absent from KG2. This, however, doesn't address the subsequent issue that GTOPDB:4998 corresponds to IL-6, which certainly is in KG2. But that seems to be a knowledge source issue, not so much a synonymizer issue

@amykglen
Copy link
Member

ah, ok, that makes sense, thanks. yeah, since the NN doesn't cluster GTOPDB:4998 with any other IL-6 curies that are in KG2 (see here), the NS doesn't include GTOPDB:4998.

cool, so is there anything else to do/discuss here, or is this issue good to close?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants