Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-train paper ranking model with data from curated_papers.tsv #1248

Merged
merged 16 commits into from
Nov 27, 2024

Conversation

nagutm
Copy link
Collaborator

@nagutm nagutm commented Nov 3, 2024

This pull request implements functionality in the paper ranking model to include data from the curated_papers.tsv file when re-training the model and excluding any curated publications from re-appearing in the predictions.tsv file.

Copy link

codecov bot commented Nov 3, 2024

Codecov Report

Attention: Patch coverage is 84.61538% with 4 lines in your changes missing coverage. Please review.

Project coverage is 46.62%. Comparing base (8950e70) to head (e94d7e9).
Report is 171 commits behind head on main.

Files with missing lines Patch % Lines
src/bioregistry/analysis/paper_ranking.py 84.61% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1248      +/-   ##
==========================================
+ Coverage   42.51%   46.62%   +4.10%     
==========================================
  Files         117      118       +1     
  Lines        8327     8286      -41     
  Branches     1963     1362     -601     
==========================================
+ Hits         3540     3863     +323     
+ Misses       4582     4238     -344     
+ Partials      205      185      -20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


pmids_to_fetch = curated_df["pubmed"].tolist()
fetched_metadata = {}
for chunk in [pmids_to_fetch[i : i + 200] for i in range(0, len(pmids_to_fetch), 200)]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an (more?)itertools function for chunking that will make this code more idiomatic

@nagutm
Copy link
Collaborator Author

nagutm commented Nov 22, 2024

The latest commit introduces a unit test for the paper_ranking model to serve as a smoke test. The test ensures that the pipeline runs successfully without errors and generates the expected output files when provided with valid inputs.

It executes the paper_ranking CLI pipeline via the CliRunner to ensure that

  • pipeline runs successfully with exit code 0.
  • Output files evaluation.tsv, importances.tsv, and predictions.tsv are generated in the expected output directory.

@bgyori bgyori merged commit b5cffaf into biopragmatics:main Nov 27, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants