-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter duplicate modifications from UniProt #1223
Comments
We should filter UniProt annotations if everything is the same from a specific gene/publication except the evidence code because these will al be duplicates. |
I have put this as high priority because once that is done I will share with UNiProt (and can describe the extent of the overlap) |
Just to check: We should only filter modifications from UniProt if the gene, term ID and publication are the same. Have I got that right? Should we also filter UniProt annotations where there is a PomBase modification and the UniProt modification doesn't have a publication/reference? |
Sometimes there is a PomBase extension (like "present during cellular response to thiabendazole") but otherwise the annotation is identical from UniProt. In those cases the PomBase annotation is more specific so we can remove the UniProt annotation? Note to self: the summary is that we can ignore extensions when looking for duplicates. |
That's correct. If the evidence is different , but the paper and everything else is the same, we will filter it. These are where a different evidence code was selected for the same experiment and I have queried this with UniPort (they have used a manual code for a HTP experiment for example) |
Absolutely! |
This comment was marked as resolved.
This comment was marked as resolved.
593 of the duplicates are modifications from PMID:18257517. There is one duplicate from PMID:12135491, which is the one with Unknown evidence in the previous comment. And the remaining 45 duplicates don't have a PMID in the UniProt data. I'm still checking to make sure that's all correct. |
I got that wrong. There are 24 that don't have a PMID. |
This adds the code to delete the redundant feature_cvterm rows. Refs #1223
Delete modifications from UniProt if there is an existing PomBase modification annotation. Refs pombase/pombase-chado#1223
It's weird that UniProt only have 593 from PMID:18257517. (we have 1010). @Antonialock can you think of a reason why UniProt might only import a subset of modifications from a publication? |
From the abstract I don't know why out input file has only 1194 proteins when there were 2887 unique sites with low FP rate. |
UniProt have 1640 in total from PMID:18257517 and 593 are duplicates. That's very odd because it would mean PomBase and UniProt have about 1000 unique annotations each from PMID:18257517. I'll dig into that because that sounds like my code is nonsense. :-) |
It's 2233 not 1640. I should go to bed. :-)
That bit is correct (I think). |
I don't think the dataset must have been fully parsed for Chado ingest |
Yes go to bed! |
|
I added the step to remove duplicate modifications to the load script for last night. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
I've had a look at the paper and the data table in the supplementary information. I can't work out how the UniProt annotations or the PomBase annotations were extracted from the data. |
I wrote a script to process the supplementary information table based on what I could understand from the paper. That gives 1711 modification annotations for 941 genes. The PomBase dataset has 1006 annotations for 557 genes. UniProt has 3239 for 1099 genes. Below is a Venn diagram of the number of genes with modifications from the three datasets. The diagram doesn't make things less confusing. :-) Meanwhile the publication says:
|
This is bizarre! |
I'm looking at the information with the supp data. It says All phosphopeptides listed are the most likely peptides reported by SEQUEST. The phosphorylation sites, shown as a (#) and the site number are those determined most likely by the Ascore algorithm. An (*) on methionine denotes oxidation. The Ascore was run for all peptides, and the values can be read from left to right in the case of multiple phosphorylation sites. Sites with Ascore values <19 are considered ambiguous, while sites with Ascore values >19 are considered localized and are presented in green. “N/A” in the Ascore means that there is only one possible phosphorylation site in the amino acid sequence. After removing redundancy, the final data set contains 2489 unique phosphopeptides from 1194 phosphoproteins. An active link to all MS/MS spectra is given on each peptide and a link to the Ascore is available on that page. So, possible we should only take the ones with Ascore values >1 OR After removing redundancy, the final data set contains 2489 unique phosphopeptides from 1194 phosphoprotein. |
What's in the list of 33 that are found by us and UniPort, but are not in your script? If we can figure out the differences we can decide which parts of the venn to include. |
The POmBase one seems more conservative. Midori may have spoken with the author. Unfortunately due to the EBI we no longer have that archive. |
The data file has Ascore1, Ascore2 and Ascore3 columns to make it more challenging. :-) My script looks at each Ascore separately. If any of the three Ascore values is > 19 that site is included in the output. If the Ascore columns are N/A the site is also included. The numbers from the script don't match the numbers reported in the manuscript so I think I must have that wrong. |
I looked at those 33 genes. These are them: Confusingly, 32 of them aren't in the spreadsheet from the publication at all even though we have data from PomBase and UniProt. I don't know what that means. :-( The one gene from the 33 that is in the spreadsheet is: SPAPB1A10.09 mod: S537 I'm very confused. |
I don't know if it helps but there is a second spreadsheet (EVIN) and most of the missing entries are in there. Except these, i.e In Uniprot these might have all been mapped to a single protein entry at this point, and we would split them to both of our identifiers. this leaves |
Lets discuss tomorrow. |
Actions:
|
They'll be filtered in Thursday night's load. |
I've generated a new version of that file. I've put it in an
The modification positions are incorrect for some genes so I'll need to run Manu's code to fix them. |
I'm back looking at this again. The easiest way to process the modifications with Manu's code is to include the new annotations in the nightly load and then Manu's pipeline will run automatically. The annotations will be wrong in Chado and on the website for a day so I'll do that this weekend. |
That's done for Friday night's load. I'll check things at the weekend. |
Manu's code has reported 86 position errors now that all the modifications from PMID:18257517 are in Chado. I've made a new issue so we can close this long issue: |
Now we get modifications from UniProt there are exact duplicates. We should filter them like we filter GO, with the PomBase annotations taking priority.
The text was updated successfully, but these errors were encountered: