Filter duplicate modifications from UniProt #1223

kimrutherford · 2024-09-24T21:15:29Z

Now we get modifications from UniProt there are exact duplicates. We should filter them like we filter GO, with the PomBase annotations taking priority.

ValWood · 2024-09-30T07:19:48Z

We should filter UniProt annotations if everything is the same from a specific gene/publication except the evidence code because these will al be duplicates.

ValWood · 2024-10-10T07:51:17Z

I have put this as high priority because once that is done I will share with UNiProt (and can describe the extent of the overlap)

kimrutherford · 2024-10-11T10:26:52Z

Just to check:

We should only filter modifications from UniProt if the gene, term ID and publication are the same. Have I got that right?

Should we also filter UniProt annotations where there is a PomBase modification and the UniProt modification doesn't have a publication/reference?

kimrutherford · 2024-10-11T10:37:16Z

Sometimes there is a PomBase extension (like "present during cellular response to thiabendazole") but otherwise the annotation is identical from UniProt. In those cases the PomBase annotation is more specific so we can remove the UniProt annotation?

Note to self: the summary is that we can ignore extensions when looking for duplicates.

ValWood · 2024-10-11T10:48:31Z

That's correct. If the evidence is different , but the paper and everything else is the same, we will filter it. These are where a different evidence code was selected for the same experiment and I have queried this with UniPort (they have used a manual code for a HTP experiment for example)

ValWood · 2024-10-11T10:49:01Z

Sometimes there is a PomBase extension (like "present during cellular response to thiabendazole") but otherwise the annotation is identical from UniProt. In those cases the PomBase annotation is more specific so we can remove the UniProt annotation?

Absolutely!

kimrutherford · 2024-10-11T12:15:19Z

My first pass at the code finds only 618 duplicate modifications from a total from UniProt of 3983.

593 of the duplicates are modifications from PMID:18257517. There is one duplicate from PMID:12135491, which is the one with Unknown evidence in the previous comment. And the remaining 45 duplicates don't have a PMID in the UniProt data.

I'm still checking to make sure that's all correct.

kimrutherford · 2024-10-11T12:16:36Z

And the remaining 45 duplicates don't have a PMID in the UniProt data.

I got that wrong. There are 24 that don't have a PMID.

Refs #1223

This adds the code to delete the redundant feature_cvterm rows. Refs #1223

Delete modifications from UniProt if there is an existing PomBase modification annotation. Refs pombase/pombase-chado#1223

ValWood · 2024-10-11T12:42:05Z

It's weird that UniProt only have 593 from PMID:18257517. (we have 1010).
I just checked and we only have one extension.
I wonder why UniProt eliminated ~400. Maybe they used some threshold?

@Antonialock can you think of a reason why UniProt might only import a subset of modifications from a publication?

Refs pombase/pombase-chado#1223

ValWood · 2024-10-11T12:44:06Z

From the abstract
In total, 2887 distinct phosphorylation sites were identified from 1194 proteins with an estimated false-discovery rate of <0.5% at the peptide level.

I don't know why out input file has only 1194 proteins when there were 2887 unique sites with low FP rate.
But I always thought this dataset was larger than 1194...

kimrutherford · 2024-10-11T12:50:29Z

It's weird that UniProt only have 593 from PMID:18257517.

UniProt have 1640 in total from PMID:18257517 and 593 are duplicates. That's very odd because it would mean PomBase and UniProt have about 1000 unique annotations each from PMID:18257517. I'll dig into that because that sounds like my code is nonsense. :-)

kimrutherford · 2024-10-11T12:54:09Z

UniProt have 1640 in total from PMID:18257517

It's 2233 not 1640. I should go to bed. :-)

and 593 are duplicates

That bit is correct (I think).

ValWood · 2024-10-11T12:55:48Z

I don't think the dataset must have been fully parsed for Chado ingest
There is a note on the session "This session has a message to curators: protein phosphorylation done in bulk format only other thing that might be curatable some day is some phosphorylation motifs"
but it does not mention any reason why the total dataset was not included.

ValWood · 2024-10-11T12:55:59Z

Yes go to bed!

ValWood · 2024-10-11T15:17:37Z

Could we delete the PomBase annotation? It's in pombe-embl/supporting_files/legacy_modifications_from_contigs.tsv:

SPBC32F12.09    rum1    MOD:00046       Unknown S13             PMID:12135491   4896    2009-02-13

Refs pombase/pombase-chado#1223

kimrutherford · 2024-10-12T08:04:28Z

I added the step to remove duplicate modifications to the load script for last night.
The removed UniProt annotations are in this log file:
https://curation.pombase.org/dumps/builds/pombase-build-2024-10-12/logs/log.2024-10-12-04-39-42.modification-filter-duplicates

kimrutherford · 2024-10-13T20:46:37Z

I've had a look at the paper and the data table in the supplementary information. I can't work out how the UniProt annotations or the PomBase annotations were extracted from the data.

kimrutherford · 2024-10-14T09:31:32Z

I've had a look at the paper and the data table in the supplementary information. I can't work out how the UniProt annotations or the PomBase annotations were extracted from the data.

I wrote a script to process the supplementary information table based on what I could understand from the paper. That gives 1711 modification annotations for 941 genes.

The PomBase dataset has 1006 annotations for 557 genes.

UniProt has 3239 for 1099 genes.

Below is a Venn diagram of the number of genes with modifications from the three datasets. The diagram doesn't make things less confusing. :-)

Meanwhile the publication says:

In total, 2887 distinct phosphorylation sites were identified from 1194 proteins

ValWood · 2024-10-14T10:12:17Z

This is bizarre!

ValWood · 2024-10-14T10:23:06Z

I'm looking at the information with the supp data. It says

All phosphopeptides listed are the most likely peptides reported by SEQUEST. The phosphorylation sites, shown as a (#) and the site number are those determined most likely by the Ascore algorithm. An (*) on methionine denotes oxidation. The Ascore was run for all peptides, and the values can be read from left to right in the case of multiple phosphorylation sites. Sites with Ascore values <19 are considered ambiguous, while sites with Ascore values >19 are considered localized and are presented in green. “N/A” in the Ascore means that there is only one possible phosphorylation site in the amino acid sequence. After removing redundancy, the final data set contains 2489 unique phosphopeptides from 1194 phosphoproteins. An active link to all MS/MS spectra is given on each peptide and a link to the Ascore is available on that page.

So, possible we should only take the ones with Ascore values >1 OR
“N/A” in the Ascore means that there is only one possible phosphorylation site in the amino acid sequence.

After removing redundancy, the final data set contains 2489 unique phosphopeptides from 1194 phosphoprotein.
probably includes all of the phopshosites, even the ones that could not be unambiguously located.

ValWood · 2024-10-14T10:24:22Z

What's in the list of 33 that are found by us and UniPort, but are not in your script?

If we can figure out the differences we can decide which parts of the venn to include.

ValWood · 2024-10-14T10:25:37Z

The POmBase one seems more conservative. Midori may have spoken with the author. Unfortunately due to the EBI we no longer have that archive.

kimrutherford · 2024-10-14T10:45:45Z

So, possible we should only take the ones with Ascore values >1 OR
“N/A” in the Ascore means that there is only one possible phosphorylation site in the amino acid sequence.

The data file has Ascore1, Ascore2 and Ascore3 columns to make it more challenging. :-)

My script looks at each Ascore separately. If any of the three Ascore values is > 19 that site is included in the output. If the Ascore columns are N/A the site is also included.

The numbers from the script don't match the numbers reported in the manuscript so I think I must have that wrong.

kimrutherford · 2024-10-14T11:21:48Z

What's in the list of 33 that are found by us and UniPort, but are not in your script?

If we can figure out the differences we can decide which parts of the venn to include.

I looked at those 33 genes. These are them:
https://www.pombase.org/results/from/id/6e05f643-cf3d-42eb-93d5-4cd620ccf7d7

Confusingly, 32 of them aren't in the spreadsheet from the publication at all even though we have data from PomBase and UniProt. I don't know what that means. :-(

The one gene from the 33 that is in the spreadsheet is: SPAPB1A10.09 mod: S537
It's excluded by my script because Ascore1 is 0.01
The S537 modification appears in three other datasets apart from PMID:18257517 so seems correct?

I'm very confused.

ValWood · 2024-10-14T12:26:49Z

I don't know if it helps but there is a second spreadsheet (EVIN) and most of the missing entries are in there.

Except these,
https://www.pombase.org/results/from/id/531ff02d-cc63-490e-98ff-c14494b68cf4
and these seem to be special because they are mainly exact (or close) duplicates of entries that are in the other set...

i.e
rpn502 = rpn501
rps1501= rps502
rps1602 = rps1601
rps1802 = rps1801
rpl2401 = rpl2402
rpl401 = rpl402
rpl502= rpl501
ubi4 = ubi1 = ubi2 = ubi3 (at least, the ubiquitin part will be identical so a mass spec would not be able to differntiate)
tif512= tif511

In Uniprot these might have all been mapped to a single protein entry at this point, and we would split them to both of our identifiers.

this leaves
ssa1
spk1
SPAC750.01
as 'magic-ed from nowhere'
I will dig further into these...

ValWood · 2024-10-14T12:27:19Z

This is how SPAC750.01 aligns to SPAC977.14C
so I am guessing these fragments would not map unambiguously

ValWood · 2024-10-14T12:28:59Z

Lets discuss tomorrow.

kimrutherford · 2024-10-15T15:25:07Z

Actions:

add annotations from the script to pombe-embl/external_data/modification_files/PMID_18257517_modifications.tsv
don't load annotations for PMID:18257517 from UniProt data file because we suspect they include modifications that are below the score cut-off from the paper (228 genes)

Refs pombase/pombase-chado#1223

kimrutherford · 2024-10-31T00:09:10Z

don't load annotations for PMID:18257517 from UniProt data file because we suspect they include modifications that are below the score cut-off from the paper (228 genes)

They'll be filtered in Thursday night's load.

Refs pombase/pombase-chado#1223

kimrutherford · 2024-10-31T01:54:18Z

add annotations from the script to pombe-embl/external_data/modification_files/PMID_18257517_modifications.tsv

I've generated a new version of that file. I've put it in an in_progress directory for now in SVN:

external_data/modification_files/in_progress/PMID_18257517_modifications.tsv

The modification positions are incorrect for some genes so I'll need to run Manu's code to fix them.

kimrutherford · 2024-11-18T04:44:01Z

I've generated a new version of that file. I've put it in an in_progress directory for now in SVN:
external_data/modification_files/in_progress/PMID_18257517_modifications.tsv
The modification positions are incorrect for some genes so I'll need to run Manu's code to fix them.

I'm back looking at this again. The easiest way to process the modifications with Manu's code is to include the new annotations in the nightly load and then Manu's pipeline will run automatically.

The annotations will be wrong in Chado and on the website for a day so I'll do that this weekend.

kimrutherford · 2024-11-22T03:47:25Z

The easiest way to process the modifications with Manu's code is to include the new annotations in the nightly load and then Manu's pipeline will run automatically.

The annotations will be wrong in Chado and on the website for a day so I'll do that this weekend.

That's done for Friday night's load. I'll check things at the weekend.

kimrutherford · 2025-01-09T03:07:29Z

The modification positions are incorrect for some genes so I'll need to run Manu's code to fix them.

Manu's code has reported 86 position errors now that all the modifications from PMID:18257517 are in Chado.

I've made a new issue so we can close this long issue:

Auto-fix the new modification position errors from Manu's allele QC code #1250

kimrutherford self-assigned this Sep 24, 2024

kimrutherford mentioned this issue Sep 29, 2024

Filter imported duplicates where only evidence code is different #1228

Closed

kimrutherford mentioned this issue Oct 6, 2024

Modifications, add assigned by column #1232

Closed

ValWood added the high priority label Oct 10, 2024

ValWood changed the title ~~Filter duplicate modifications~~ Filter duplicate modifications from UniProt Oct 10, 2024

ValWood added the filtering label Oct 10, 2024

kimrutherford mentioned this issue Oct 11, 2024

Add "assigned_by" property in Chado for all modifcations from PomBase #1233

Closed

This comment was marked as resolved.

Sign in to view

kimrutherford added a commit that referenced this issue Oct 11, 2024

Add process for filtering redundant modifications

c060c55

Refs #1223

kimrutherford added a commit that referenced this issue Oct 11, 2024

Add missing deletion code to ModificationFilter

2cddccc

This adds the code to delete the redundant feature_cvterm rows. Refs #1223

kimrutherford added a commit to pombase/pombase-legacy that referenced this issue Oct 11, 2024

Add process to filter duplicate modifications

1b9541d

Delete modifications from UniProt if there is an existing PomBase modification annotation. Refs pombase/pombase-chado#1223

kimrutherford added a commit to pombase/pombase-legacy that referenced this issue Oct 11, 2024

Fix logging of duplicate modifications

50f77f5

Refs pombase/pombase-chado#1223

kimrutherford added a commit to pombase/pombase-legacy that referenced this issue Oct 12, 2024

Fix typo in modification-filter process call

3fd0f19

Refs pombase/pombase-chado#1223

This comment was marked as outdated.

Sign in to view

kimrutherford added the next label Oct 15, 2024

kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Oct 31, 2024

Add --filter-references for annotation-create

12e387b

Refs pombase/pombase-chado#1223

kimrutherford added a commit to pombase/pombase-legacy that referenced this issue Oct 31, 2024

Add --filter-references for annotation-create

88485b4

Refs pombase/pombase-chado#1223

kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Oct 31, 2024

Add UniProt data reference filter flag

6149253

Refs pombase/pombase-chado#1223

kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Oct 31, 2024

Fix test compilation failure

8b032a2

Refs pombase/pombase-chado#1223

kimrutherford added a commit to pombase/pombase-legacy that referenced this issue Oct 31, 2024

Add UniProt data reference filter flag

e4a749f

Refs pombase/pombase-chado#1223

kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue Oct 31, 2024

Fix typo in filter flag name

00a4d1d

Refs pombase/pombase-chado#1223

kimrutherford mentioned this issue Nov 1, 2024

Ability to sort slim results on gene column pombase/website#2246

Closed

kimrutherford mentioned this issue Jan 9, 2025

Auto-fix the new modification position errors from Manu's allele QC code #1250

Open

kimrutherford closed this as completed Jan 9, 2025

Filter duplicate modifications from UniProt #1223

Filter duplicate modifications from UniProt #1223

Comments

kimrutherford commented Sep 24, 2024

ValWood commented Sep 30, 2024

ValWood commented Oct 10, 2024 • edited Loading

kimrutherford commented Oct 11, 2024

kimrutherford commented Oct 11, 2024

ValWood commented Oct 11, 2024

ValWood commented Oct 11, 2024

This comment was marked as resolved.

kimrutherford commented Oct 11, 2024

kimrutherford commented Oct 11, 2024

ValWood commented Oct 11, 2024

ValWood commented Oct 11, 2024 • edited Loading

kimrutherford commented Oct 11, 2024

kimrutherford commented Oct 11, 2024

ValWood commented Oct 11, 2024 • edited Loading

ValWood commented Oct 11, 2024

ValWood commented Oct 11, 2024

kimrutherford commented Oct 12, 2024

This comment was marked as outdated.

This comment was marked as outdated.

kimrutherford commented Oct 13, 2024

kimrutherford commented Oct 14, 2024

ValWood commented Oct 14, 2024

ValWood commented Oct 14, 2024

ValWood commented Oct 14, 2024

ValWood commented Oct 14, 2024

kimrutherford commented Oct 14, 2024

kimrutherford commented Oct 14, 2024

ValWood commented Oct 14, 2024

ValWood commented Oct 14, 2024

ValWood commented Oct 14, 2024

kimrutherford commented Oct 15, 2024 • edited Loading

kimrutherford commented Oct 31, 2024

kimrutherford commented Oct 31, 2024

kimrutherford commented Nov 18, 2024

kimrutherford commented Nov 22, 2024

kimrutherford commented Jan 9, 2025

ValWood commented Oct 10, 2024 •

edited

Loading

ValWood commented Oct 11, 2024 •

edited

Loading

ValWood commented Oct 11, 2024 •

edited

Loading

kimrutherford commented Oct 15, 2024 •

edited

Loading