Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate human entries in the GO database #2021

Closed
kltm opened this issue Jul 5, 2018 · 26 comments
Closed

duplicate human entries in the GO database #2021

kltm opened this issue Jul 5, 2018 · 26 comments
Assignees
Labels

Comments

@kltm
Copy link
Member

kltm commented Jul 5, 2018

From @ValWood on June 29, 2018 11:29

We still have dupliciate entries in the GO database, which makes analyses difficult

These 15 identifiers were found to be ambiguous: ATP6AP2 CALM1 EIF3F GABARAP HIST1H2AI HOXD4 IDS KLK9 MED17 MUC21 NSG1 PI4K2B SUPT3H TMSB15B TRAPPC2L

ATP6AP2

and an unreviewed Trembl entry?
http://amigo.geneontology.org/amigo/gene_product/UniProtKB:A0A1C7CYW4
and the annotated swiss prot entry
https://www.uniprot.org/uniprot/O75787

Calm1, the uniprot entry
http://amigo.geneontology.org/amigo/gene_product/UniProtKB:P0DP23
and an unannotated PR? entry
http://amigo.geneontology.org/amigo/term/PR:000004978

Can we get rid of the duplicates?
How do they get in?

Copied from original issue: geneontology/helpdesk#139

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @ValWood on June 29, 2018 11:30

FYI @Antonialock

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

I'm not quite sure what the question here is. For example, "GABARAP" is a symbol coming in from UniProt, MGI, and RGD. Symbols are quite often duplicates, which is why many services use namespaced identifiers or filtering mechanisms to isolate the actual entity they want.

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @ValWood on July 2, 2018 20:7

These are duplicate human entries.
We should only have one entry per GP in GO.

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

@ValWood You'd consider the above to be the same entity?
https://www.uniprot.org/uniprot/O95166
https://www.uniprot.org/uniprot/H6UMI1

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @ValWood on July 2, 2018 20:56

Yes,

https://www.uniprot.org/uniprot/O95166
https://www.uniprot.org/uniprot/H6UMI1
are the same entity.
Now I look more closely one is unreviewed, so it shouldn't get into GO?

Some might be exact duplicates at different loci (calmodulin histones and elongation factors), but these should be distinguished by having different names.

I think its a question for UniProt...

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @ValWood on July 2, 2018 21:2

I know it is only a small number (now), it is much improved, but we should find out what the problem is with the ingest that makes this possible. Presumably there is only one entry in reference proteomes. It's really really important to represent the human proteome uniquely and correctly for analysis.

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @cmungall on July 2, 2018 21:58

I think you meant this as the 2nd URL

http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q9NVC6

(which is the correct one)

This is how we get it from GOA:

curl -s -L ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human.gaf.gz | gzip -dc | grep MED17

e.g.

UniProtKB       A0A1W2PRB8      MED17           GO:0003712      GO_REF:0000002  IEA     InterPro:IPR019313      F       Mediator of RNA polymerase II transcription subunit 17  MED17|MED17     protein taxon:9606      20180616        InterPro
UniProtKB       A0A1W2PRB8      MED17           GO:0006351      GO_REF:0000038  IEA     UniProtKB-KW:KW-0804    P       Mediator of RNA polymerase II transcription subunit 17  MED17|MED17     protein taxon:9606      20180616        UniProt         
UniProtKB       A0A1W2PRB8      MED17           GO:0006357      GO_REF:0000002  IEA     InterPro:IPR019313      P       Mediator of RNA polymerase II transcription subunit 17  MED17|MED17     protein taxon:9606      20180616        InterPro                
UniProtKB       A0A1W2PRB8      MED17           GO:0016592      GO_REF:0000002  IEA     InterPro:IPR019313      C       Mediator of RNA polymerase II transcription subunit 17  MED17|MED17     protein taxon:9606      20180616        InterPro                
UniProtKB       Q9NVC6  MED17           GO:0003712      PMID:10198638   IDA             F       Mediator of RNA polymerase II transcription subunit 17  MED17|MED17|ARC77|CRSP6|DRIP77|DRIP80|TRAP80    protein taxon:9606      20030822        UniProt         
UniProtKB       Q9NVC6  MED17           GO:0003712      PMID:12218053   IDA             F       Mediator of RNA polymerase II transcription subunit 17  MED17|MED17|ARC77|CRSP6|DRIP77|DRIP80|TRAP80    protein taxon:9606      20030822        UniProt         
UniProtKB       Q9NVC6  MED17           GO:0003713      PMID:12037571   IDA             F       Mediator of RNA polymerase II transcription subunit 17  MED17|MED17|ARC77|CRSP6|DRIP77|DRIP80|TRAP80    protein taxon:9606      20101104        MGI             
[snip]

@tonysawfordebi and @alexsign can you take a look (I assigned you Alex, but Tony can reassign to himself if appropriate)

@dougli1sqrd and @pgaudet should we have a soft check / warning for >1 ID with the same symbol in a species? It would be useful to have this kind of reporting information up-front.

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @ValWood on July 2, 2018 22:4

should we have a soft check / warning for >1 ID with the same symbol in a species?

yes please, that would be a useful QC check

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @selewis on July 3, 2018 3:30

@chris Mungall [email protected]
Not sure if this is relevant, but I know in PANTHER there is a many2many
relationship between genes and proteins (though never both at once, just

1gene to 1 protein, or >1 protein to a gene)

-S

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @ValWood on July 3, 2018 6:5

found it
pantherdb/db-PAINT#1

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @tonysawfordebi on July 3, 2018 7:35

According to the data that we get from UniProt, both Q9NVC6 (Swiss-Prot) and A0A1W2PRB8 (TrEMBL) are canonical entries in the human GCRP, and they both have MED17 as the gene name, which doesn't seem right. I'll raise this with UniProt.

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @ValWood on July 3, 2018 7:41

Proteins nearly good, RNA's, same issues looming
geneontology/amigo#511
I filed this on the AmiGO tracker, because that's where I saw the problem. But it's clearly the wrong place.
Who would be the correct person for this part of the pipeline?
geneontology/amigo#511

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @tonysawfordebi on July 3, 2018 7:57

I just checked the list of genes from the top of this thread, and this is what I found:

Gene Name Entry Type
ATP6AP2 O75787 Swiss-Prot
ATP6AP2 A0A1C7CYW4 TrEMBL
CALM1 P0DP23 Swiss-Prot
EIF3F O00303 Swiss-Prot
GABARAP O95166 Swiss-Prot
GABARAP H6UMI1 TrEMBL
HOXD4 P09016 Swiss-Prot
HOXD4 A0A087WSZ3 TrEMBL
IDS P22304 Swiss-Prot
IDS B3KWA1 TrEMBL
KLK9 Q9UKQ9 Swiss-Prot
KLK9 Q2XQG4 TrEMBL
MED17 Q9NVC6 Swiss-Prot
MED17 A0A1W2PRB8 TrEMBL
MUC21 Q5SSG8 Swiss-Prot
MUC21 A0A0G2JKD1 TrEMBL
MUC21 A0A140T8X8 TrEMBL
NSG1 P42857 Swiss-Prot
PI4K2B Q8TCG2 Swiss-Prot
PI4K2B G5E9Z4 TrEMBL
SUPT3H O75486 Swiss-Prot
TMSB15B P0CG35 Swiss-Prot
TMSB15B A0A087X1C1 TrEMBL
TRAPPC2L Q9UL33 Swiss-Prot
TRAPPC2L H3BP13 TrEMBL

So, it appears that there's no ambiguity as far as CALM1, EIF3F, NSG1, and SUPT3H are concerned (there's only one canonical entry in the GCRP), but for the others there definitely appears to be something amiss (particularly MUC21).

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @tonysawfordebi on July 3, 2018 8:7

This is what we have (taken from UniProt):

Gene Entry Type Proteome Canonical Entry
CALM1 P0DP23 Swiss-Prot Canonical
CALM1 B4DJ51 TrEMBL none
CALM1 G3V479 TrEMBL Isoform P0DP23
CALM1 E7ETZ0 TrEMBL Isoform P0DP23
CALM1 Q96HY3 TrEMBL Isoform P0DP23
CALM1 M0QZ52 TrEMBL Isoform P0DP23
CALM1 G3V226 TrEMBL Isoform P0DP23
CALM1 G3V361 TrEMBL Isoform P0DP23
EIF3F O00303 Swiss-Prot Canonical
EIF3F A0A1W2PP79 TrEMBL Isoform O00303
EIF3F E9PQV8 TrEMBL Isoform O00303
EIF3F B3KSH1 TrEMBL none
EIF3F B4DMT5 TrEMBL none
EIF3F H0YDT6 TrEMBL Isoform O00303
NSG1 P42857 Swiss-Prot Canonical
NSG1 A0A0A6YYJ2 TrEMBL Isoform P42857
SUPT3H O75486 Swiss-Prot Canonical
SUPT3H Q5VWT9 TrEMBL Isoform O75486
SUPT3H B4E1H0 TrEMBL Isoform O75486
SUPT3H Q5U608 TrEMBL none
SUPT3H A0A024RD67 TrEMBL none

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @ValWood on July 3, 2018 8:8

actually these are different proteins, with the same name. How does that happen?
http://amigo.geneontology.org/amigo/gene_product/UniProtKB:A0A0A6YYJ2
Neuron-specific protein family member 1

http://amigo.geneontology.org/amigo/gene_product/UniProtKB:P42857
Neuronal vesicle trafficking-associated protein 1

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @ValWood on July 3, 2018 8:15

CALM1 https://www.uniprot.org/uniprot/P62158 is obsolete

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @cmungall on July 3, 2018 16:27

I filed this on the AmiGO tracker, because that's where I saw the problem. But it's clearly the wrong place. Who would be the correct person for this part of the pipeline?

In this case it's inputs to the pipeline rather than the pipeline itself, and Tony is already on it. In general I would say the go-annotations tracker is good for coordinating with any contributing group about their annotations. helpdesk always fine for triaging, and I suggest keeping this discussion here to avoid breaking history.

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

From @dougli1sqrd on July 3, 2018 18:18

@cmungall Yeah we could do something like that. So, the check would have to find different gene product ids that have the same label? Do we have labels in the RDF for gene products? I can take a look.

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

@cmungall As whole tickets are ported, including comments, there would be no break in history.

@dougli1sqrd symbols are not globally unique; in fact, occasionally not locally either--it may be worth asking whether there should be a mechanism per-species.

@kltm
Copy link
Member Author

kltm commented Jul 5, 2018

Looks like I won't be moving it: google/github-issue-mover#128 (comment)

@geneontology geneontology deleted a comment from kltm Jul 5, 2018
@geneontology geneontology deleted a comment from kltm Jul 5, 2018
@cmungall
Copy link
Member

Looks like discussion has continued in original ticket, so closing this rather than forking discussion

@pgaudet
Copy link
Contributor

pgaudet commented Feb 10, 2022

I think this is fixed?

@pgaudet pgaudet closed this as completed Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants