Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve disease/target evidence ingestion from uniprot #3459

Closed
4 of 8 tasks
DSuveges opened this issue Sep 12, 2024 · 6 comments · May be fixed by opentargets/evidence_datasource_parsers#214
Closed
4 of 8 tasks

Improve disease/target evidence ingestion from uniprot #3459

DSuveges opened this issue Sep 12, 2024 · 6 comments · May be fixed by opentargets/evidence_datasource_parsers#214
Assignees
Labels
Data Relates to Open Targets data team Platform Issues related to Open Targets Platform

Comments

@DSuveges
Copy link

DSuveges commented Sep 12, 2024

Uniprot provides a curated set of naturally occurring, protein coding variation that are involved in diseases. This dataset has already been captured by the uniprot_variants dataset produced by the parser developed by the Uniprot team. However, recent development in the platform and the upcoming integration of the genetics product made it necessary to reconsider the evidence generation process and the data model.

Consideration

  • Evidence generation is not integrated into our pipelines, the codebase is written in java, which makes it harder to implement new features and maintain the code.
  • Disease label to EFO mapping is not efficient: we are losing a large number of evidence (eg. on the BRAF page there are 38 disease associated variants reported for 6 diseases, however we only get 7 evidence for two diseases)
  • Some of the mappings provided by the java pipeline is inaccurate.
  • The current pipeline does not map rs identifiers to the conventional variant identifiers (especially considering multiallelic variations)
  • The current pipeline does not consider variant/disease annotation that would inform us about target modulation (eg. elevated kinase activity; efficiently induces cell transformation is not parsed all evidence is annotated with the constant: "targetModulation":"up_or_down" )
  • The current pipeline applies a very crude way to assess disease to target confidence (essentially does a string match for eg. this entry may act as a disease modifier would imply a weaker confidence)

TODOs

  • Develop a sparql query to retrieve disease/variant/target triples from uniprot API.
  • Parse valuable data from response.
  • Develop a robust, incremental way to map rsids to variant identifiers (considering matching substituted amino acids!)
  • Integrate with the existing disease mapping pipeline (Ontoma based logic)
  • Parse and classify target modulation data.
  • Parse and classify target/disease confidence.
  • Change disease/target schema if required.
  • Integrate parser into pipeline.
@DSuveges DSuveges added Data Relates to Open Targets data team Platform Issues related to Open Targets Platform New Genetics Project labels Sep 12, 2024
@DSuveges DSuveges self-assigned this Sep 12, 2024
@DSuveges
Copy link
Author

SPARQL data retrieval

  • The developed query fetches 41,475 entries in ~20sec from the Uniprot API.
  • The data contains evidence for 3934 unique proteins and 5248 unique disease labels establised by 26,368 unique rs identifier.
  • The number of unmapped associations: 5440. This is the number of unique disease/target pairs in the dataset. This is the theoretical maximum we can get from this dataset assuming 1 to 1 disease to EFO mapping.
  • The SPARQL API can also provide the reference and alternative amino acid for each rsid/disease/protein pair. This is required for variant mapping.

RsID to variant ID mapping

  • To get variant data, a pipeline was developed to extract VEP annotation for each overlapping transcript.
  • The VEP API endpoint allows extracting the canonical vcf string (vcf_string=1), which is required to normalise indel variants.
  • Even the POST endpoint won't allow more than 200 rsids at the same time, so the pipeline needs to loop through the variants and pool the data together at the end.
  • Because requesting VEP data for 200 variants takes ~3minutes, which means the mapping the full dataset takes around 7-8 hours, I decided to develop a caching, where only missing variants will be mapped. Upon completing the mapping the new variants are added to the cache. The cache needs to be refreshed from time to time.
  • When mapping is done the evidence data is joined to the mapping by the rsid. Then we need to select the right variant identifier.
  • To find the right id, we follow this logic: identifier for a biallelic variant is automatically accepted. A variant id of a multiallelic variant is accepted if the substituted amino acid is matching (it means one rsid might have multiple variant ids if they code the same amino acid). If there's no matching amino acid for a variant, all options are accepted. This ensures weird substitutions/indes/stop gains are not lost in mapping.

Disease mapping

The usual disease mapping pipeline is applied as we use for other parsers:

from common.ontology import add_efo_mapping

mapped_df = add_efo_mapping(unmapped_df, spark, '.')

@DSuveges
Copy link
Author

DSuveges commented Sep 12, 2024

Comparison with existing evidence set

These comparisons expects valid EFO mappings:

  • New evidence data: evidence: 43,890 associations: 5,133
  • Old evidence data: evidence: 41,922 associations: 4,380
  • New evidence data -> mapped disease/target pairs: 4732
  • Old evidence data -> mapped disease/target pairs: 3508

Conclusions:

  • From the original dataset, where we have 5,440 disease/target pairs, we could map 4.7k (86%) however in the previous pipeline this number was only 3.5k (64%).
  • The old pipeline heavily explodes data.

Let's see P05067 vs OMIM:605714. There are four rsids as evidence for this association on the uniprot page. So assuming perfect mapping, it would mean one association and 4 evidence. This is exactly what we see in the new pipeline:

+------------------+----------------------------------------+-------------------+-----------+-------------------------+----------------------------------------+
|targetFromSourceId|diseaseFromSource                       |diseaseFromSourceId|variantRsId|diseaseFromSourceMappedId|name                                    |
+------------------+----------------------------------------+-------------------+-----------+-------------------------+----------------------------------------+
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |rs63750579 |MONDO_0011583            |cerebral amyloid angiopathy, APP-related|
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |rs63750921 |MONDO_0011583            |cerebral amyloid angiopathy, APP-related|
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |rs63749810 |MONDO_0011583            |cerebral amyloid angiopathy, APP-related|
+------------------+----------------------------------------+-------------------+-----------+-------------------------+----------------------------------------+

However there are 32 evidence in the old pipeline because there's an 8x explosion as that pipeline maps the disease to 8 EFOs:

+------------------+----------------------------------------+-------------------+-------------------------+--------------------------------------------------------------+
|targetFromSourceId|diseaseFromSource                       |diseaseFromSourceId|diseaseFromSourceMappedId|name                                                          |
+------------------+----------------------------------------+-------------------+-------------------------+--------------------------------------------------------------+
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_324713          |Hereditary cerebral hemorrhage with amyloidosis, Italian type |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_100006          |Hereditary cerebral hemorrhage with amyloidosis, Dutch type   |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_324718          |Hereditary cerebral hemorrhage with amyloidosis, Flemish type |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |MONDO_0011583            |cerebral amyloid angiopathy, APP-related                      |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_324708          |Hereditary cerebral hemorrhage with amyloidosis, Iowa type    |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_324703          |Hereditary cerebral hemorrhage with amyloidosis, Piedmont type|
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_324723          |Hereditary cerebral hemorrhage with amyloidosis, Arctic type  |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_85458           |Hereditary cerebral hemorrhage with amyloidosis               |
+------------------+----------------------------------------+-------------------+-------------------------+--------------------------------------------------------------+

@DSuveges
Copy link
Author

Comparing mappings with the previous pipeline

When looking at disease/target pairs in the source, there are only 38 pairs that were not mapped to EFO by the new pipeline. Some mapping seems to be relevant, however a number of mappings are not found in the EFO slim, that our disease index is based on (name is null) :

+-------------------------+-------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|diseaseFromSourceMappedId|diseaseFromSourceId|diseaseFromSource                                            |name                                                                   |
+-------------------------+-------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|MONDO_0011875            |OMIM:607628        |Epilepsy, idiopathic generalized 11                          |null                                                                   |
|MONDO_0008490            |OMIM:184840        |Otospondylomegaepiphyseal dysplasia, autosomal dominant      |otospondylomegaepiphyseal dysplasia, autosomal dominant                |
|Orphanet_166100          |OMIM:184840        |Otospondylomegaepiphyseal dysplasia, autosomal dominant      |Stickler syndrome type 3                                               |
|EFO_0009080              |OMIM:308100        |Ichthyosis, X-linked                                         |x-linked ichthyosis with steryl-sulfatase deficiency                   |
|MONDO_0010622            |OMIM:308100        |Ichthyosis, X-linked                                         |recessive X-linked ichthyosis                                          |
|MONDO_0013568            |OMIM:614090        |Sick sinus syndrome 3                                        |null                                                                   |
|MONDO_0012161            |OMIM:608957        |Immunodeficiency 116                                         |null                                                                   |
|MONDO_0011650            |OMIM:606217        |Atrioventricular septal defect 2                             |null                                                                   |
|MONDO_0011652            |OMIM:606232        |Phelan-McDermid syndrome                                     |Phelan-McDermid syndrome                                               |
|Orphanet_48652           |OMIM:606232        |Phelan-McDermid syndrome                                     |Monosomy 22q13                                                         |
|MONDO_0859376            |OMIM:620241        |Hydrocephalus, congenital, 5                                 |null                                                                   |
|MONDO_0013957            |OMIM:614893        |Immunodeficiency 32A                                         |null                                                                   |
|MONDO_0011875            |OMIM:607628        |Juvenile absence epilepsy 2                                  |null                                                                   |
|MONDO_0859316            |OMIM:620121        |Iron overload                                                |null                                                                   |
|MONDO_0012843            |OMIM:612269        |Epilepsy, childhood absence 5                                |null                                                                   |
|MONDO_0008633            |OMIM:191900        |Muckle-Wells syndrome                                        |Muckle-Wells syndrome                                                  |
|MONDO_0044315            |OMIM:617439        |Craniosynostosis 7                                           |null                                                                   |
|MONDO_0010389            |OMIM:300645        |Immunodeficiency 34                                          |null                                                                   |
|MONDO_0011776            |OMIM:607115        |Chronic infantile neurologic cutaneous and articular syndrome|CINCA syndrome                                                         |
|MONDO_0008693            |OMIM:200110        |Ablepharon-macrostomia syndrome                              |ablepharon macrostomia syndrome                                        |
|MONDO_0012670            |OMIM:611451        |Deafness, autosomal recessive, 63                            |autosomal recessive nonsyndromic hearing loss 63                       |
|MONDO_0009288            |OMIM:232240        |Glycogen storage disease 1C                                  |glycogen storage disease Ib                                            |
|Orphanet_79259           |OMIM:232240        |Glycogen storage disease 1C                                  |Glycogen storage disease due to glucose-6-phosphatase deficiency type b|
|Orphanet_364             |OMIM:232240        |Glycogen storage disease 1C                                  |Glycogen storage disease due to glucose-6-phosphatase deficiency       |
|MONDO_0011163            |OMIM:601887        |Malignant hyperthermia 5                                     |null                                                                   |
|MONDO_0009288            |OMIM:232220        |Glycogen storage disease 1B                                  |glycogen storage disease Ib                                            |
|Orphanet_364             |OMIM:232220        |Glycogen storage disease 1B                                  |Glycogen storage disease due to glucose-6-phosphatase deficiency       |
|Orphanet_79259           |OMIM:232220        |Glycogen storage disease 1B                                  |Glycogen storage disease due to glucose-6-phosphatase deficiency type b|
|MONDO_0011875            |OMIM:607628        |Juvenile myoclonic epilepsy 8                                |null                                                                   |
|MONDO_0008856            |OMIM:209950        |Immunodeficiency 27A                                         |null                                                                   |
|MONDO_0008853            |OMIM:209885        |Barber-Say syndrome                                          |Barber-Say syndrome                                                    |
|MONDO_0013955            |OMIM:614891        |Immunodeficiency 30                                          |null                                                                   |
|MONDO_0010576            |OMIM:304400        |Deafness, X-linked, 2                                        |X-linked mixed hearing loss with perilymphatic gusher                  |
|Orphanet_383             |OMIM:304400        |Deafness, X-linked, 2                                        |X-linked mixed deafness with perilymphatic gusher                      |
|MONDO_0013956            |OMIM:614892        |Immunodeficiency 31A                                         |null                                                                   |
|MONDO_0030334            |OMIM:619441        |Encephalitis, acute, infection (viral)-induced, 11           |null                                                                   |
|EFO_0004190              |OMIM:609887        |Glaucoma 1, open angle, G                                    |open-angle glaucoma                                                    |
|MONDO_0012141            |OMIM:608864        |Non-syndromic orofacial cleft 6                              |null                                                                   |
|MONDO_0030004            |OMIM:618830        |Autism 20                                                    |null                                                                   |
|MONDO_0013498            |OMIM:613950        |Schizophrenia 15                                             |null                                                                   |
|MONDO_0011159            |OMIM:601868        |Deafness, autosomal dominant, 13                             |null                                                                   |
|MONDO_0009335            |OMIM:235400        |Hemolytic uremic syndrome, atypical, 1                       |null                                                                   |
|MONDO_0007349            |OMIM:120100        |Familial cold autoinflammatory syndrome 1                    |familial cold autoinflammatory syndrome 1                              |
|Orphanet_47045           |OMIM:120100        |Familial cold autoinflammatory syndrome 1                    |Familial cold urticaria                                                |
|MONDO_0014710            |OMIM:616622        |Immunodeficiency 42                                          |null                                                                   |
|MONDO_0044206            |OMIM:215150        |Otospondylomegaepiphyseal dysplasia, autosomal recessive     |otospondylomegaepiphyseal dysplasia, autosomal recessive               |
|MONDO_0007849            |OMIM:148200        |Keratoendothelitis fugax hereditaria                         |keratitis fugax hereditaria                                            |
+-------------------------+-------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+

Altogether, I'm happy with the performance compared to the existing pipeline.

@prashantuniyal02
Copy link

@DSuveges , can we close this issue as completed?

@DSuveges
Copy link
Author

No, we are ~75% of the way. Unfortunately in the upcoming weeks, I surely won't have the capacity to push it completion, but aiming for the December release.

@DSuveges
Copy link
Author

I'm closing this ticket, because the new way of evidence generation works and tested, however the enrichment of the evidence with directionality will be sorted out as a different effort

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Relates to Open Targets data team Platform Issues related to Open Targets Platform
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants