-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node synonymizer not a superset of SRI node normalizer #2408
Comments
Just to point out a potential issue but would like to hear @amykglen's professional insight on it: Currently, KG2c's preferred node is associated with node synonymizer. If we finally plan to merge all contents from the NN into the NS, we need to make sure that the preferred node returned by a node synonymizer for an bioentity should be in the KG2. For example, node entities that NN recognizes that the NS does not are probably because they are not in the KG2. If they have some synonyms in the KG2, I think we should use those sysnonyms in the KG2 as the preferred cure in NS. |
so I think these curies are not recognized by the NS because they are both 1) not in KG2 and 2) are not equivalent to a curie in KG2 (according to the NN). in other words, the NS includes all curies in the NN that are in clusters that intersect with KG2. or, in even different words, the NS is a superset of the NN, but only for clusters involving nodes in KG2. we have always limited the NS to KG2's contents in some fashion, because the NN is kind of huge - 550 million curies last I checked (250 million of those are Proteins and 200 million are SmallMolecules). if we wanted to expand the NS to include all curies in the NN, it'd take some experimentation - this would greatly slow down the NS's build process and result in a much larger NS sqlite file. I'm thinking it'd probably be better to just query the NN API in realtime for curies that the NS doesn't recognize (in my experience their API is very fast and reliable). to @chunyuma's point - we're required by Translator to use the NN's preferred curies, so the way we currently handle this situation is like this: say CURIE:A is present in KG2pre but the preferred curie for its cluster according to the NN is CURIE:X, which is not present in KG2pre. in this situation the NS still extracts CURIE:X from the NN and reports the preferred curie for this cluster as CURIE:X, which means that CURIE:X becomes the identifier for this node in KG2c (since the KG2c build uses the NS to determine the preferred ids and |
Ah, that explains my confusion, thanks for clarifying!
Yes, that is what I was/we were thinking too: just use NN as a fall back, and somewhere indicate this node is not in KG2 |
@dkoslicki - I'm a little unclear about this line of yours:
from what I can see, GTOPDB:4998 is neither in KG2 nor in the NS. see this page, which corresponds to what is in the NS: https://arax.ci.transltr.io/?term=GTOPDB:4998. (or also this: https://arax.ci.transltr.io/api/arax/v1.4/entity?q=GTOPDB%3A4998) I think this is true for all curies reported here, yeah? (i.e., they are absent from KG2 and the NS) please correct me if I'm wrong! |
Ah, that makes it easier to include NN in NS by calling API. It will not affect NS too much. Thank you for letting us know, Amy! |
This was from my misunderstanding that the NS is only a superset of the NN for those nodes in KG2. And yes, without having checked all of them, I assume they are all absent from KG2. This, however, doesn't address the subsequent issue that |
ah, ok, that makes sense, thanks. yeah, since the NN doesn't cluster cool, so is there anything else to do/discuss here, or is this issue good to close? |
As @stephwon has been investigating how to integrate Gwenlyn's microbiomeKG with our metagenomicsKG (derived from RTX-KG2), the following issue was encountered again: mismatches between the SRI node normalizer (NN) and our node synonymizer (NS). In particular, we have a number of examples of CURIES that NN recognizes that the NS does not. These include (thanks Steph):
A number of these are for nodes that are not in KG2 proper (eg. non-human orthologs). But others are just absent from the NS (compare GTOPDB:4998 with the NN and the NS at arax.ci.transltr.io and the /entity endpoint).
At the 10/30/2024 AHM, the suggestion was made to fail over to to the NN if the NS doesn't know about a particular CURIE. This would definitely lead to more completeness, but might hide some opportunity for nuanced conflations (eg. do we want to distinguish between non-human and human orthologs?).
Thoughts?
The text was updated successfully, but these errors were encountered: