-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duplicate human entries in the GO database #2021
Comments
From @ValWood on June 29, 2018 11:30 FYI @Antonialock |
I'm not quite sure what the question here is. For example, "GABARAP" is a symbol coming in from UniProt, MGI, and RGD. Symbols are quite often duplicates, which is why many services use namespaced identifiers or filtering mechanisms to isolate the actual entity they want. |
From @ValWood on July 2, 2018 20:7 These are duplicate human entries. |
@ValWood You'd consider the above to be the same entity? |
From @ValWood on July 2, 2018 20:56 Yes, https://www.uniprot.org/uniprot/O95166 Some might be exact duplicates at different loci (calmodulin histones and elongation factors), but these should be distinguished by having different names. I think its a question for UniProt... |
From @ValWood on July 2, 2018 20:58 So MED17, there is only one copy in the human genome, but we have http://amigo.geneontology.org/amigo/gene_product/UniProtKB:A0A1W2PRB8 http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q9NVC6 |
From @ValWood on July 2, 2018 21:2 I know it is only a small number (now), it is much improved, but we should find out what the problem is with the ingest that makes this possible. Presumably there is only one entry in reference proteomes. It's really really important to represent the human proteome uniquely and correctly for analysis. |
From @cmungall on July 2, 2018 21:58 I think you meant this as the 2nd URL http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q9NVC6 (which is the correct one) This is how we get it from GOA:
e.g.
@tonysawfordebi and @alexsign can you take a look (I assigned you Alex, but Tony can reassign to himself if appropriate) @dougli1sqrd and @pgaudet should we have a soft check / warning for >1 ID with the same symbol in a species? It would be useful to have this kind of reporting information up-front. |
From @ValWood on July 2, 2018 22:4
yes please, that would be a useful QC check |
From @selewis on July 3, 2018 3:30 @chris Mungall [email protected]
-S |
From @ValWood on July 3, 2018 6:5 found it |
From @tonysawfordebi on July 3, 2018 7:35 According to the data that we get from UniProt, both Q9NVC6 (Swiss-Prot) and A0A1W2PRB8 (TrEMBL) are canonical entries in the human GCRP, and they both have MED17 as the gene name, which doesn't seem right. I'll raise this with UniProt. |
From @ValWood on July 3, 2018 7:41 Proteins nearly good, RNA's, same issues looming |
From @tonysawfordebi on July 3, 2018 7:57 I just checked the list of genes from the top of this thread, and this is what I found:
So, it appears that there's no ambiguity as far as CALM1, EIF3F, NSG1, and SUPT3H are concerned (there's only one canonical entry in the GCRP), but for the others there definitely appears to be something amiss (particularly MUC21). |
From @ValWood on July 3, 2018 8:1 For CALM1 in AmiGO I see http://amigo.geneontology.org/amigo/gene_product/UniProtKB:P0DP23 |
From @ValWood on July 3, 2018 8:3 For NSG1 |
From @tonysawfordebi on July 3, 2018 8:7 This is what we have (taken from UniProt):
|
From @ValWood on July 3, 2018 8:8 actually these are different proteins, with the same name. How does that happen? http://amigo.geneontology.org/amigo/gene_product/UniProtKB:P42857 |
From @ValWood on July 3, 2018 8:15 CALM1 https://www.uniprot.org/uniprot/P62158 is obsolete |
From @cmungall on July 3, 2018 16:27
In this case it's inputs to the pipeline rather than the pipeline itself, and Tony is already on it. In general I would say the go-annotations tracker is good for coordinating with any contributing group about their annotations. helpdesk always fine for triaging, and I suggest keeping this discussion here to avoid breaking history. |
From @dougli1sqrd on July 3, 2018 18:18 @cmungall Yeah we could do something like that. So, the check would have to find different gene product ids that have the same label? Do we have labels in the RDF for gene products? I can take a look. |
@cmungall As whole tickets are ported, including comments, there would be no break in history. @dougli1sqrd symbols are not globally unique; in fact, occasionally not locally either--it may be worth asking whether there should be a mechanism per-species. |
Looks like I won't be moving it: google/github-issue-mover#128 (comment) |
Looks like discussion has continued in original ticket, so closing this rather than forking discussion |
I think this is fixed? |
From @ValWood on June 29, 2018 11:29
We still have dupliciate entries in the GO database, which makes analyses difficult
These 15 identifiers were found to be ambiguous: ATP6AP2 CALM1 EIF3F GABARAP HIST1H2AI HOXD4 IDS KLK9 MED17 MUC21 NSG1 PI4K2B SUPT3H TMSB15B TRAPPC2L
ATP6AP2
and an unreviewed Trembl entry?
http://amigo.geneontology.org/amigo/gene_product/UniProtKB:A0A1C7CYW4
and the annotated swiss prot entry
https://www.uniprot.org/uniprot/O75787
Calm1, the uniprot entry
http://amigo.geneontology.org/amigo/gene_product/UniProtKB:P0DP23
and an unannotated PR? entry
http://amigo.geneontology.org/amigo/term/PR:000004978
Can we get rid of the duplicates?
How do they get in?
Copied from original issue: geneontology/helpdesk#139
The text was updated successfully, but these errors were encountered: