Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter duplicate modifications from UniProt #1223

Closed
kimrutherford opened this issue Sep 24, 2024 · 36 comments
Closed

Filter duplicate modifications from UniProt #1223

kimrutherford opened this issue Sep 24, 2024 · 36 comments

Comments

@kimrutherford
Copy link
Member

Now we get modifications from UniProt there are exact duplicates. We should filter them like we filter GO, with the PomBase annotations taking priority.

@ValWood
Copy link
Member

ValWood commented Sep 30, 2024

We should filter UniProt annotations if everything is the same from a specific gene/publication except the evidence code because these will al be duplicates.

@ValWood
Copy link
Member

ValWood commented Oct 10, 2024

I have put this as high priority because once that is done I will share with UNiProt (and can describe the extent of the overlap)

@ValWood ValWood changed the title Filter duplicate modifications Filter duplicate modifications from UniProt Oct 10, 2024
@kimrutherford
Copy link
Member Author

Just to check:

We should only filter modifications from UniProt if the gene, term ID and publication are the same. Have I got that right?

Should we also filter UniProt annotations where there is a PomBase modification and the UniProt modification doesn't have a publication/reference?

@kimrutherford
Copy link
Member Author

Sometimes there is a PomBase extension (like "present during cellular response to thiabendazole") but otherwise the annotation is identical from UniProt. In those cases the PomBase annotation is more specific so we can remove the UniProt annotation?

Note to self: the summary is that we can ignore extensions when looking for duplicates.

@ValWood
Copy link
Member

ValWood commented Oct 11, 2024

That's correct. If the evidence is different , but the paper and everything else is the same, we will filter it. These are where a different evidence code was selected for the same experiment and I have queried this with UniPort (they have used a manual code for a HTP experiment for example)

@ValWood
Copy link
Member

ValWood commented Oct 11, 2024

Sometimes there is a PomBase extension (like "present during cellular response to thiabendazole") but otherwise the annotation is identical from UniProt. In those cases the PomBase annotation is more specific so we can remove the UniProt annotation?

Absolutely!

@kimrutherford

This comment was marked as resolved.

@kimrutherford
Copy link
Member Author

My first pass at the code finds only 618 duplicate modifications from a total from UniProt of 3983.

593 of the duplicates are modifications from PMID:18257517. There is one duplicate from PMID:12135491, which is the one with Unknown evidence in the previous comment. And the remaining 45 duplicates don't have a PMID in the UniProt data.

I'm still checking to make sure that's all correct.

@kimrutherford
Copy link
Member Author

And the remaining 45 duplicates don't have a PMID in the UniProt data.

I got that wrong. There are 24 that don't have a PMID.

kimrutherford added a commit that referenced this issue Oct 11, 2024
This adds the code to delete the redundant feature_cvterm rows.

Refs #1223
kimrutherford added a commit to pombase/pombase-legacy that referenced this issue Oct 11, 2024
Delete modifications from UniProt if there is an existing PomBase
modification annotation.

Refs pombase/pombase-chado#1223
@ValWood
Copy link
Member

ValWood commented Oct 11, 2024

It's weird that UniProt only have 593 from PMID:18257517. (we have 1010).
I just checked and we only have one extension.
I wonder why UniProt eliminated ~400. Maybe they used some threshold?

@Antonialock can you think of a reason why UniProt might only import a subset of modifications from a publication?

kimrutherford added a commit to pombase/pombase-legacy that referenced this issue Oct 11, 2024
@ValWood
Copy link
Member

ValWood commented Oct 11, 2024

From the abstract
In total, 2887 distinct phosphorylation sites were identified from 1194 proteins with an estimated false-discovery rate of <0.5% at the peptide level.

I don't know why out input file has only 1194 proteins when there were 2887 unique sites with low FP rate.
But I always thought this dataset was larger than 1194...

@kimrutherford
Copy link
Member Author

It's weird that UniProt only have 593 from PMID:18257517.

UniProt have 1640 in total from PMID:18257517 and 593 are duplicates. That's very odd because it would mean PomBase and UniProt have about 1000 unique annotations each from PMID:18257517. I'll dig into that because that sounds like my code is nonsense. :-)

@kimrutherford
Copy link
Member Author

UniProt have 1640 in total from PMID:18257517

It's 2233 not 1640. I should go to bed. :-)

and 593 are duplicates

That bit is correct (I think).

@ValWood
Copy link
Member

ValWood commented Oct 11, 2024

I don't think the dataset must have been fully parsed for Chado ingest
There is a note on the session "This session has a message to curators: protein phosphorylation done in bulk format only other thing that might be curatable some day is some phosphorylation motifs"
but it does not mention any reason why the total dataset was not included.

@ValWood
Copy link
Member

ValWood commented Oct 11, 2024

Yes go to bed!

@ValWood
Copy link
Member

ValWood commented Oct 11, 2024

  • Could we delete the PomBase annotation? It's in pombe-embl/supporting_files/legacy_modifications_from_contigs.tsv:
SPBC32F12.09    rum1    MOD:00046       Unknown S13             PMID:12135491   4896    2009-02-13

kimrutherford added a commit to pombase/pombase-legacy that referenced this issue Oct 12, 2024
@kimrutherford
Copy link
Member Author

I added the step to remove duplicate modifications to the load script for last night.
The removed UniProt annotations are in this log file:
https://curation.pombase.org/dumps/builds/pombase-build-2024-10-12/logs/log.2024-10-12-04-39-42.modification-filter-duplicates

@Antonialock

This comment was marked as outdated.

@ValWood

This comment was marked as outdated.

@kimrutherford
Copy link
Member Author

I've had a look at the paper and the data table in the supplementary information. I can't work out how the UniProt annotations or the PomBase annotations were extracted from the data.

@kimrutherford
Copy link
Member Author

I've had a look at the paper and the data table in the supplementary information. I can't work out how the UniProt annotations or the PomBase annotations were extracted from the data.

I wrote a script to process the supplementary information table based on what I could understand from the paper. That gives 1711 modification annotations for 941 genes.

The PomBase dataset has 1006 annotations for 557 genes.

UniProt has 3239 for 1099 genes.

Below is a Venn diagram of the number of genes with modifications from the three datasets. The diagram doesn't make things less confusing. :-)

Meanwhile the publication says:

In total, 2887 distinct phosphorylation sites were identified from 1194 proteins

image

@ValWood
Copy link
Member

ValWood commented Oct 14, 2024

This is bizarre!

@ValWood
Copy link
Member

ValWood commented Oct 14, 2024

I'm looking at the information with the supp data. It says

All phosphopeptides listed are the most likely peptides reported by SEQUEST. The phosphorylation sites, shown as a (#) and the site number are those determined most likely by the Ascore algorithm. An (*) on methionine denotes oxidation. The Ascore was run for all peptides, and the values can be read from left to right in the case of multiple phosphorylation sites. Sites with Ascore values <19 are considered ambiguous, while sites with Ascore values >19 are considered localized and are presented in green. “N/A” in the Ascore means that there is only one possible phosphorylation site in the amino acid sequence. After removing redundancy, the final data set contains 2489 unique phosphopeptides from 1194 phosphoproteins. An active link to all MS/MS spectra is given on each peptide and a link to the Ascore is available on that page.

So, possible we should only take the ones with Ascore values >1 OR
“N/A” in the Ascore means that there is only one possible phosphorylation site in the amino acid sequence.

After removing redundancy, the final data set contains 2489 unique phosphopeptides from 1194 phosphoprotein.
probably includes all of the phopshosites, even the ones that could not be unambiguously located.

@ValWood
Copy link
Member

ValWood commented Oct 14, 2024

What's in the list of 33 that are found by us and UniPort, but are not in your script?

If we can figure out the differences we can decide which parts of the venn to include.

@ValWood
Copy link
Member

ValWood commented Oct 14, 2024

The POmBase one seems more conservative. Midori may have spoken with the author. Unfortunately due to the EBI we no longer have that archive.

@kimrutherford
Copy link
Member Author

So, possible we should only take the ones with Ascore values >1 OR
“N/A” in the Ascore means that there is only one possible phosphorylation site in the amino acid sequence.

The data file has Ascore1, Ascore2 and Ascore3 columns to make it more challenging. :-)

My script looks at each Ascore separately. If any of the three Ascore values is > 19 that site is included in the output. If the Ascore columns are N/A the site is also included.

The numbers from the script don't match the numbers reported in the manuscript so I think I must have that wrong.

@kimrutherford
Copy link
Member Author

What's in the list of 33 that are found by us and UniPort, but are not in your script?

If we can figure out the differences we can decide which parts of the venn to include.

I looked at those 33 genes. These are them:
https://www.pombase.org/results/from/id/6e05f643-cf3d-42eb-93d5-4cd620ccf7d7

Confusingly, 32 of them aren't in the spreadsheet from the publication at all even though we have data from PomBase and UniProt. I don't know what that means. :-(

The one gene from the 33 that is in the spreadsheet is: SPAPB1A10.09 mod: S537
It's excluded by my script because Ascore1 is 0.01
The S537 modification appears in three other datasets apart from PMID:18257517 so seems correct?

I'm very confused.

@ValWood
Copy link
Member

ValWood commented Oct 14, 2024

I don't know if it helps but there is a second spreadsheet (EVIN) and most of the missing entries are in there.

Except these,
https://www.pombase.org/results/from/id/531ff02d-cc63-490e-98ff-c14494b68cf4
and these seem to be special because they are mainly exact (or close) duplicates of entries that are in the other set...

i.e
rpn502 = rpn501
rps1501= rps502
rps1602 = rps1601
rps1802 = rps1801
rpl2401 = rpl2402
rpl401 = rpl402
rpl502= rpl501
ubi4 = ubi1 = ubi2 = ubi3 (at least, the ubiquitin part will be identical so a mass spec would not be able to differntiate)
tif512= tif511

In Uniprot these might have all been mapped to a single protein entry at this point, and we would split them to both of our identifiers.

this leaves
ssa1
spk1
SPAC750.01
as 'magic-ed from nowhere'
I will dig further into these...

@ValWood
Copy link
Member

ValWood commented Oct 14, 2024

This is how SPAC750.01 aligns to SPAC977.14C
so I am guessing these fragments would not map unambiguously

Screenshot 2024-10-14 at 13 18 52

@ValWood
Copy link
Member

ValWood commented Oct 14, 2024

Lets discuss tomorrow.

@kimrutherford
Copy link
Member Author

kimrutherford commented Oct 15, 2024

Actions:

  • add annotations from the script to pombe-embl/external_data/modification_files/PMID_18257517_modifications.tsv
  • don't load annotations for PMID:18257517 from UniProt data file because we suspect they include modifications that are below the score cut-off from the paper (228 genes)

kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Oct 31, 2024
kimrutherford added a commit to pombase/pombase-legacy that referenced this issue Oct 31, 2024
@kimrutherford
Copy link
Member Author

don't load annotations for PMID:18257517 from UniProt data file because we suspect they include modifications that are below the score cut-off from the paper (228 genes)

They'll be filtered in Thursday night's load.

kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Oct 31, 2024
kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Oct 31, 2024
kimrutherford added a commit to pombase/pombase-legacy that referenced this issue Oct 31, 2024
kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Oct 31, 2024
@kimrutherford
Copy link
Member Author

add annotations from the script to pombe-embl/external_data/modification_files/PMID_18257517_modifications.tsv

I've generated a new version of that file. I've put it in an in_progress directory for now in SVN:

external_data/modification_files/in_progress/PMID_18257517_modifications.tsv

The modification positions are incorrect for some genes so I'll need to run Manu's code to fix them.

@kimrutherford
Copy link
Member Author

I've generated a new version of that file. I've put it in an in_progress directory for now in SVN:
external_data/modification_files/in_progress/PMID_18257517_modifications.tsv
The modification positions are incorrect for some genes so I'll need to run Manu's code to fix them.

I'm back looking at this again. The easiest way to process the modifications with Manu's code is to include the new annotations in the nightly load and then Manu's pipeline will run automatically.

The annotations will be wrong in Chado and on the website for a day so I'll do that this weekend.

@kimrutherford
Copy link
Member Author

The easiest way to process the modifications with Manu's code is to include the new annotations in the nightly load and then Manu's pipeline will run automatically.

The annotations will be wrong in Chado and on the website for a day so I'll do that this weekend.

That's done for Friday night's load. I'll check things at the weekend.

@kimrutherford
Copy link
Member Author

The modification positions are incorrect for some genes so I'll need to run Manu's code to fix them.

Manu's code has reported 86 position errors now that all the modifications from PMID:18257517 are in Chado.

I've made a new issue so we can close this long issue:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants