Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ClinVar pipeline for new XML format #1652

Merged
merged 10 commits into from
Nov 19, 2024
Merged

Update ClinVar pipeline for new XML format #1652

merged 10 commits into from
Nov 19, 2024

Conversation

rileyhgrant
Copy link
Contributor

@rileyhgrant rileyhgrant commented Nov 5, 2024

Resolves #1569

  • Updates the ClinVar pipeline's XML parsing step to parse the new ClinVar XML format
  • Minor edits to accomodate ClinVar updates
    • On the frontend, renames ClinVar's 'Clinical significance" to "germline classification", following ClinVar's change in terminology
    • Adds entries to the Gold Star mapping dictionary to accomodate new terminology
  • Bumps Hail to 0.2.128 to allow for VEPing of GRCh37 varints using Hail's built in convenience function
  • Still includes the commented out Aliases for ClinVar indices along with a "TODO:" comment. Following the practice of several previous PRs, temporarily specifying an index, rather than an alias, in this PR will allow for an easy rollback if needed. Once this proves stable in production, another PR will go back to using the aliases

Screenshots of a local browser and API querying an indice in the production ES cluster that uses data from the output of this PR:

  • Screenshot 2024-11-08 at 16 00 52

  • Screenshot 2024-11-08 at 16 01 06

  • Screenshot 2024-11-08 at 16 01 09

  • Screenshot 2024-11-08 at 16 01 19

  • Screenshot 2024-11-08 at 16 02 31

  • Screenshot 2024-11-08 at 16 02 37

Copy link
Contributor

@phildarnowsky-broad phildarnowsky-broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff overall, I did spot some things, mainly legacy stuff that we should take this opportunity to clean up.

data-pipeline/src/data_pipeline/datasets/clinvar.py Outdated Show resolved Hide resolved
data-pipeline/src/data_pipeline/datasets/clinvar.py Outdated Show resolved Hide resolved
data-pipeline/src/data_pipeline/datasets/clinvar.py Outdated Show resolved Hide resolved
Both our CI steps that involve python use 3.9, and the existing pyproject.toml file also uses this version, pin this in tool versions so it exists as plaintext and so tools like asdf and mise can pick this up
As of several months ago, support for multi region buckets that supplied the VEP specific data for Hail's built in GRCh37 VEP helper became prohibitively expensive. The Hail team moved towards only supporting specific regions, further usage of this feature requires a bump to at least Hail 0.2.128
@rileyhgrant
Copy link
Contributor Author

Heya @phildarnowsky-broad, thanks for the review.

I believe I've addressed all the comments, with one new fixup commit per comment. I've tagged you for re-review, at your convenience.

Copy link
Contributor

@phildarnowsky-broad phildarnowsky-broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go, nice work

@rileyhgrant rileyhgrant force-pushed the update-clinvar branch 2 times, most recently from b1b00fb to d780598 Compare November 19, 2024 18:24
@rileyhgrant rileyhgrant merged commit 9503aaf into main Nov 19, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update ClinVar pipeline
2 participants