-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve disease/target evidence ingestion from uniprot #3459
Improve disease/target evidence ingestion from uniprot #3459
Comments
SPARQL data retrieval
RsID to variant ID mapping
Disease mappingThe usual disease mapping pipeline is applied as we use for other parsers: from common.ontology import add_efo_mapping
mapped_df = add_efo_mapping(unmapped_df, spark, '.') |
Comparison with existing evidence setThese comparisons expects valid EFO mappings:
Conclusions:
Let's see P05067 vs OMIM:605714. There are four rsids as evidence for this association on the uniprot page. So assuming perfect mapping, it would mean one association and 4 evidence. This is exactly what we see in the new pipeline:
However there are 32 evidence in the old pipeline because there's an 8x explosion as that pipeline maps the disease to 8 EFOs:
|
Comparing mappings with the previous pipelineWhen looking at disease/target pairs in the source, there are only 38 pairs that were not mapped to EFO by the new pipeline. Some mapping seems to be relevant, however a number of mappings are not found in the EFO slim, that our disease index is based on (
Altogether, I'm happy with the performance compared to the existing pipeline. |
@DSuveges , can we close this issue as completed? |
No, we are ~75% of the way. Unfortunately in the upcoming weeks, I surely won't have the capacity to push it completion, but aiming for the December release. |
I'm closing this ticket, because the new way of evidence generation works and tested, however the enrichment of the evidence with directionality will be sorted out as a different effort |
Uniprot provides a curated set of naturally occurring, protein coding variation that are involved in diseases. This dataset has already been captured by the
uniprot_variants
dataset produced by the parser developed by the Uniprot team. However, recent development in the platform and the upcoming integration of the genetics product made it necessary to reconsider the evidence generation process and the data model.Consideration
elevated kinase activity; efficiently induces cell transformation
is not parsed all evidence is annotated with the constant:"targetModulation":"up_or_down"
)this entry may act as a disease modifier
would imply a weaker confidence)TODOs
The text was updated successfully, but these errors were encountered: