Re-train paper ranking model with data from `curated_papers.tsv` #1248

nagutm · 2024-11-03T01:44:35Z

This pull request implements functionality in the paper ranking model to include data from the curated_papers.tsv file when re-training the model and excluding any curated publications from re-appearing in the predictions.tsv file.

codecov · 2024-11-03T01:59:38Z

Codecov Report

Attention: Patch coverage is 84.61538% with 4 lines in your changes missing coverage. Please review.

Project coverage is 46.62%. Comparing base (8950e70) to head (e94d7e9).
Report is 171 commits behind head on main.

Files with missing lines	Patch %	Lines
src/bioregistry/analysis/paper_ranking.py	84.61%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1248      +/-   ##
==========================================
+ Coverage   42.51%   46.62%   +4.10%     
==========================================
  Files         117      118       +1     
  Lines        8327     8286      -41     
  Branches     1963     1362     -601     
==========================================
+ Hits         3540     3863     +323     
+ Misses       4582     4238     -344     
+ Partials      205      185      -20

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/bioregistry/analysis/paper_ranking.py

cthoyt · 2024-11-03T11:44:36Z

src/bioregistry/analysis/paper_ranking.py

+
+    pmids_to_fetch = curated_df["pubmed"].tolist()
+    fetched_metadata = {}
+    for chunk in [pmids_to_fetch[i : i + 200] for i in range(0, len(pmids_to_fetch), 200)]:


there's an (more?)itertools function for chunking that will make this code more idiomatic

nagutm · 2024-11-22T14:06:27Z

The latest commit introduces a unit test for the paper_ranking model to serve as a smoke test. The test ensures that the pipeline runs successfully without errors and generates the expected output files when provided with valid inputs.

It executes the paper_ranking CLI pipeline via the CliRunner to ensure that

pipeline runs successfully with exit code 0.
Output files evaluation.tsv, importances.tsv, and predictions.tsv are generated in the expected output directory.

Mufaddal Naguthanawala added 2 commits November 1, 2024 12:34

Filter curated papers

9732320

Combine all data sources

f9a4179

Simplify curated paper loading

aaf35c7

cthoyt reviewed Nov 3, 2024

View reviewed changes

src/bioregistry/analysis/paper_ranking.py Outdated Show resolved Hide resolved

cthoyt reviewed Nov 3, 2024

View reviewed changes

cthoyt and others added 8 commits November 4, 2024 14:57

Merge branch 'main' into pr/1248

387582e

Update paper_ranking.py

aa4614b

Add type annotations

d955acd

Fix typos

745fb95

Hardcode curated papers path

ad49617

Formatting

fd9bf1e

Add test for load_curated_papers

db761c6

Add smoke test for paper_ranking.py

74d467c

Mufaddal Naguthanawala and others added 5 commits November 22, 2024 09:47

Update pyproject.toml

e944613

Mock curated_papers_df return value

e5841c9

Check to_csv call count instead of actual predictions file

2ba2c95

Use proper paths to data files

ed94a83

Revert mocking for load_curated_papers

e94d7e9

bgyori merged commit b5cffaf into biopragmatics:main Nov 27, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-train paper ranking model with data from `curated_papers.tsv` #1248

Re-train paper ranking model with data from `curated_papers.tsv` #1248

nagutm commented Nov 3, 2024

codecov bot commented Nov 3, 2024 •

edited

Loading

cthoyt Nov 3, 2024

nagutm commented Nov 22, 2024

Re-train paper ranking model with data from curated_papers.tsv #1248

Re-train paper ranking model with data from curated_papers.tsv #1248

Conversation

nagutm commented Nov 3, 2024

codecov bot commented Nov 3, 2024 • edited Loading

Codecov Report

cthoyt Nov 3, 2024

Choose a reason for hiding this comment

nagutm commented Nov 22, 2024

Re-train paper ranking model with data from `curated_papers.tsv` #1248

Re-train paper ranking model with data from `curated_papers.tsv` #1248

codecov bot commented Nov 3, 2024 •

edited

Loading