Improve disease/target evidence ingestion from uniprot #3459

DSuveges · 2024-09-12T12:46:33Z

Uniprot provides a curated set of naturally occurring, protein coding variation that are involved in diseases. This dataset has already been captured by the uniprot_variants dataset produced by the parser developed by the Uniprot team. However, recent development in the platform and the upcoming integration of the genetics product made it necessary to reconsider the evidence generation process and the data model.

Consideration

Evidence generation is not integrated into our pipelines, the codebase is written in java, which makes it harder to implement new features and maintain the code.
Disease label to EFO mapping is not efficient: we are losing a large number of evidence (eg. on the BRAF page there are 38 disease associated variants reported for 6 diseases, however we only get 7 evidence for two diseases)
Some of the mappings provided by the java pipeline is inaccurate.
The current pipeline does not map rs identifiers to the conventional variant identifiers (especially considering multiallelic variations)
The current pipeline does not consider variant/disease annotation that would inform us about target modulation (eg. elevated kinase activity; efficiently induces cell transformation is not parsed all evidence is annotated with the constant: "targetModulation":"up_or_down" )
The current pipeline applies a very crude way to assess disease to target confidence (essentially does a string match for eg. this entry may act as a disease modifier would imply a weaker confidence)

TODOs

Develop a sparql query to retrieve disease/variant/target triples from uniprot API.
Parse valuable data from response.
Develop a robust, incremental way to map rsids to variant identifiers (considering matching substituted amino acids!)
Integrate with the existing disease mapping pipeline (Ontoma based logic)
Parse and classify target modulation data.
Parse and classify target/disease confidence.
Change disease/target schema if required.
Integrate parser into pipeline.

The text was updated successfully, but these errors were encountered:

DSuveges · 2024-09-12T13:11:47Z

SPARQL data retrieval

The developed query fetches 41,475 entries in ~20sec from the Uniprot API.
The data contains evidence for 3934 unique proteins and 5248 unique disease labels establised by 26,368 unique rs identifier.
The number of unmapped associations: 5440. This is the number of unique disease/target pairs in the dataset. This is the theoretical maximum we can get from this dataset assuming 1 to 1 disease to EFO mapping.
The SPARQL API can also provide the reference and alternative amino acid for each rsid/disease/protein pair. This is required for variant mapping.

RsID to variant ID mapping

To get variant data, a pipeline was developed to extract VEP annotation for each overlapping transcript.
The VEP API endpoint allows extracting the canonical vcf string (vcf_string=1), which is required to normalise indel variants.
Even the POST endpoint won't allow more than 200 rsids at the same time, so the pipeline needs to loop through the variants and pool the data together at the end.
Because requesting VEP data for 200 variants takes ~3minutes, which means the mapping the full dataset takes around 7-8 hours, I decided to develop a caching, where only missing variants will be mapped. Upon completing the mapping the new variants are added to the cache. The cache needs to be refreshed from time to time.
When mapping is done the evidence data is joined to the mapping by the rsid. Then we need to select the right variant identifier.
To find the right id, we follow this logic: identifier for a biallelic variant is automatically accepted. A variant id of a multiallelic variant is accepted if the substituted amino acid is matching (it means one rsid might have multiple variant ids if they code the same amino acid). If there's no matching amino acid for a variant, all options are accepted. This ensures weird substitutions/indes/stop gains are not lost in mapping.

Disease mapping

The usual disease mapping pipeline is applied as we use for other parsers:

from common.ontology import add_efo_mapping

mapped_df = add_efo_mapping(unmapped_df, spark, '.')

DSuveges · 2024-09-12T13:39:16Z

Comparison with existing evidence set

These comparisons expects valid EFO mappings:

New evidence data: evidence: 43,890 associations: 5,133
Old evidence data: evidence: 41,922 associations: 4,380
New evidence data -> mapped disease/target pairs: 4732
Old evidence data -> mapped disease/target pairs: 3508

Conclusions:

From the original dataset, where we have 5,440 disease/target pairs, we could map 4.7k (86%) however in the previous pipeline this number was only 3.5k (64%).
The old pipeline heavily explodes data.

Let's see P05067 vs OMIM:605714. There are four rsids as evidence for this association on the uniprot page. So assuming perfect mapping, it would mean one association and 4 evidence. This is exactly what we see in the new pipeline:

+------------------+----------------------------------------+-------------------+-----------+-------------------------+----------------------------------------+
|targetFromSourceId|diseaseFromSource                       |diseaseFromSourceId|variantRsId|diseaseFromSourceMappedId|name                                    |
+------------------+----------------------------------------+-------------------+-----------+-------------------------+----------------------------------------+
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |rs63750579 |MONDO_0011583            |cerebral amyloid angiopathy, APP-related|
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |rs63750921 |MONDO_0011583            |cerebral amyloid angiopathy, APP-related|
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |rs63749810 |MONDO_0011583            |cerebral amyloid angiopathy, APP-related|
+------------------+----------------------------------------+-------------------+-----------+-------------------------+----------------------------------------+

However there are 32 evidence in the old pipeline because there's an 8x explosion as that pipeline maps the disease to 8 EFOs:

+------------------+----------------------------------------+-------------------+-------------------------+--------------------------------------------------------------+
|targetFromSourceId|diseaseFromSource                       |diseaseFromSourceId|diseaseFromSourceMappedId|name                                                          |
+------------------+----------------------------------------+-------------------+-------------------------+--------------------------------------------------------------+
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_324713          |Hereditary cerebral hemorrhage with amyloidosis, Italian type |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_100006          |Hereditary cerebral hemorrhage with amyloidosis, Dutch type   |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_324718          |Hereditary cerebral hemorrhage with amyloidosis, Flemish type |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |MONDO_0011583            |cerebral amyloid angiopathy, APP-related                      |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_324708          |Hereditary cerebral hemorrhage with amyloidosis, Iowa type    |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_324703          |Hereditary cerebral hemorrhage with amyloidosis, Piedmont type|
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_324723          |Hereditary cerebral hemorrhage with amyloidosis, Arctic type  |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_85458           |Hereditary cerebral hemorrhage with amyloidosis               |
+------------------+----------------------------------------+-------------------+-------------------------+--------------------------------------------------------------+

DSuveges · 2024-09-12T14:08:36Z

Comparing mappings with the previous pipeline

When looking at disease/target pairs in the source, there are only 38 pairs that were not mapped to EFO by the new pipeline. Some mapping seems to be relevant, however a number of mappings are not found in the EFO slim, that our disease index is based on (name is null) :

+-------------------------+-------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|diseaseFromSourceMappedId|diseaseFromSourceId|diseaseFromSource                                            |name                                                                   |
+-------------------------+-------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|MONDO_0011875            |OMIM:607628        |Epilepsy, idiopathic generalized 11                          |null                                                                   |
|MONDO_0008490            |OMIM:184840        |Otospondylomegaepiphyseal dysplasia, autosomal dominant      |otospondylomegaepiphyseal dysplasia, autosomal dominant                |
|Orphanet_166100          |OMIM:184840        |Otospondylomegaepiphyseal dysplasia, autosomal dominant      |Stickler syndrome type 3                                               |
|EFO_0009080              |OMIM:308100        |Ichthyosis, X-linked                                         |x-linked ichthyosis with steryl-sulfatase deficiency                   |
|MONDO_0010622            |OMIM:308100        |Ichthyosis, X-linked                                         |recessive X-linked ichthyosis                                          |
|MONDO_0013568            |OMIM:614090        |Sick sinus syndrome 3                                        |null                                                                   |
|MONDO_0012161            |OMIM:608957        |Immunodeficiency 116                                         |null                                                                   |
|MONDO_0011650            |OMIM:606217        |Atrioventricular septal defect 2                             |null                                                                   |
|MONDO_0011652            |OMIM:606232        |Phelan-McDermid syndrome                                     |Phelan-McDermid syndrome                                               |
|Orphanet_48652           |OMIM:606232        |Phelan-McDermid syndrome                                     |Monosomy 22q13                                                         |
|MONDO_0859376            |OMIM:620241        |Hydrocephalus, congenital, 5                                 |null                                                                   |
|MONDO_0013957            |OMIM:614893        |Immunodeficiency 32A                                         |null                                                                   |
|MONDO_0011875            |OMIM:607628        |Juvenile absence epilepsy 2                                  |null                                                                   |
|MONDO_0859316            |OMIM:620121        |Iron overload                                                |null                                                                   |
|MONDO_0012843            |OMIM:612269        |Epilepsy, childhood absence 5                                |null                                                                   |
|MONDO_0008633            |OMIM:191900        |Muckle-Wells syndrome                                        |Muckle-Wells syndrome                                                  |
|MONDO_0044315            |OMIM:617439        |Craniosynostosis 7                                           |null                                                                   |
|MONDO_0010389            |OMIM:300645        |Immunodeficiency 34                                          |null                                                                   |
|MONDO_0011776            |OMIM:607115        |Chronic infantile neurologic cutaneous and articular syndrome|CINCA syndrome                                                         |
|MONDO_0008693            |OMIM:200110        |Ablepharon-macrostomia syndrome                              |ablepharon macrostomia syndrome                                        |
|MONDO_0012670            |OMIM:611451        |Deafness, autosomal recessive, 63                            |autosomal recessive nonsyndromic hearing loss 63                       |
|MONDO_0009288            |OMIM:232240        |Glycogen storage disease 1C                                  |glycogen storage disease Ib                                            |
|Orphanet_79259           |OMIM:232240        |Glycogen storage disease 1C                                  |Glycogen storage disease due to glucose-6-phosphatase deficiency type b|
|Orphanet_364             |OMIM:232240        |Glycogen storage disease 1C                                  |Glycogen storage disease due to glucose-6-phosphatase deficiency       |
|MONDO_0011163            |OMIM:601887        |Malignant hyperthermia 5                                     |null                                                                   |
|MONDO_0009288            |OMIM:232220        |Glycogen storage disease 1B                                  |glycogen storage disease Ib                                            |
|Orphanet_364             |OMIM:232220        |Glycogen storage disease 1B                                  |Glycogen storage disease due to glucose-6-phosphatase deficiency       |
|Orphanet_79259           |OMIM:232220        |Glycogen storage disease 1B                                  |Glycogen storage disease due to glucose-6-phosphatase deficiency type b|
|MONDO_0011875            |OMIM:607628        |Juvenile myoclonic epilepsy 8                                |null                                                                   |
|MONDO_0008856            |OMIM:209950        |Immunodeficiency 27A                                         |null                                                                   |
|MONDO_0008853            |OMIM:209885        |Barber-Say syndrome                                          |Barber-Say syndrome                                                    |
|MONDO_0013955            |OMIM:614891        |Immunodeficiency 30                                          |null                                                                   |
|MONDO_0010576            |OMIM:304400        |Deafness, X-linked, 2                                        |X-linked mixed hearing loss with perilymphatic gusher                  |
|Orphanet_383             |OMIM:304400        |Deafness, X-linked, 2                                        |X-linked mixed deafness with perilymphatic gusher                      |
|MONDO_0013956            |OMIM:614892        |Immunodeficiency 31A                                         |null                                                                   |
|MONDO_0030334            |OMIM:619441        |Encephalitis, acute, infection (viral)-induced, 11           |null                                                                   |
|EFO_0004190              |OMIM:609887        |Glaucoma 1, open angle, G                                    |open-angle glaucoma                                                    |
|MONDO_0012141            |OMIM:608864        |Non-syndromic orofacial cleft 6                              |null                                                                   |
|MONDO_0030004            |OMIM:618830        |Autism 20                                                    |null                                                                   |
|MONDO_0013498            |OMIM:613950        |Schizophrenia 15                                             |null                                                                   |
|MONDO_0011159            |OMIM:601868        |Deafness, autosomal dominant, 13                             |null                                                                   |
|MONDO_0009335            |OMIM:235400        |Hemolytic uremic syndrome, atypical, 1                       |null                                                                   |
|MONDO_0007349            |OMIM:120100        |Familial cold autoinflammatory syndrome 1                    |familial cold autoinflammatory syndrome 1                              |
|Orphanet_47045           |OMIM:120100        |Familial cold autoinflammatory syndrome 1                    |Familial cold urticaria                                                |
|MONDO_0014710            |OMIM:616622        |Immunodeficiency 42                                          |null                                                                   |
|MONDO_0044206            |OMIM:215150        |Otospondylomegaepiphyseal dysplasia, autosomal recessive     |otospondylomegaepiphyseal dysplasia, autosomal recessive               |
|MONDO_0007849            |OMIM:148200        |Keratoendothelitis fugax hereditaria                         |keratitis fugax hereditaria                                            |
+-------------------------+-------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+

Altogether, I'm happy with the performance compared to the existing pipeline.

prashantuniyal02 · 2024-09-20T13:15:31Z

@DSuveges , can we close this issue as completed?

DSuveges · 2024-09-20T14:32:53Z

No, we are ~75% of the way. Unfortunately in the upcoming weeks, I surely won't have the capacity to push it completion, but aiming for the December release.

DSuveges · 2024-10-15T09:06:43Z

I'm closing this ticket, because the new way of evidence generation works and tested, however the enrichment of the evidence with directionality will be sorted out as a different effort

DSuveges added Data Relates to Open Targets data team Platform Issues related to Open Targets Platform New Genetics Project labels Sep 12, 2024

DSuveges self-assigned this Sep 12, 2024

DSuveges linked a pull request Sep 12, 2024 that will close this issue

feat: adding parser for uniprot_variants evidence opentargets/evidence_datasource_parsers#214

Open

prashantuniyal02 removed the New Genetics Project label Sep 16, 2024

DSuveges closed this as completed Oct 15, 2024

DSuveges mentioned this issue Oct 15, 2024

Sorting out Uniprot literature evidence #3568

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve disease/target evidence ingestion from uniprot #3459

Improve disease/target evidence ingestion from uniprot #3459

DSuveges commented Sep 12, 2024 •

edited

Loading

DSuveges commented Sep 12, 2024

DSuveges commented Sep 12, 2024 •

edited

Loading

DSuveges commented Sep 12, 2024

prashantuniyal02 commented Sep 20, 2024

DSuveges commented Sep 20, 2024

DSuveges commented Oct 15, 2024

Improve disease/target evidence ingestion from uniprot #3459

Improve disease/target evidence ingestion from uniprot #3459

Comments

DSuveges commented Sep 12, 2024 • edited Loading

Consideration

TODOs

DSuveges commented Sep 12, 2024

SPARQL data retrieval

RsID to variant ID mapping

Disease mapping

DSuveges commented Sep 12, 2024 • edited Loading

Comparison with existing evidence set

Conclusions:

DSuveges commented Sep 12, 2024

Comparing mappings with the previous pipeline

prashantuniyal02 commented Sep 20, 2024

DSuveges commented Sep 20, 2024

DSuveges commented Oct 15, 2024

DSuveges commented Sep 12, 2024 •

edited

Loading

DSuveges commented Sep 12, 2024 •

edited

Loading