Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new provider for hgnc.symbol: cbgda #1225

Merged
merged 5 commits into from
Nov 2, 2024
Merged

Conversation

nagutm
Copy link
Collaborator

@nagutm nagutm commented Oct 24, 2024

This resource does create unique identifiers in the format CB1, CB104, etc. as seen in the image below. However, the unique identifiers for each entity are not resolvable through this format of identifiers but rather the name of the gene for cbgda.gene and the name of the disease for cbgda.disease.

image

"orcid": "0009-0009-5240-7463"
},
"description": "This collection represents diseases linked to host genes identified via genome-wide CRISPR screens. It includes detailed disease classifications, gene-disease associations, and integrated data on genetic factors contributing to disease development.",
"example": "Glioblastoma",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a discussion about if this qualifies as a semantic space or not. Combine with the fact that it's not a notable (or, from the UI design, high quality) resource, maybe we should come up with some criteria for skipping this kind of resource. This is similar to the fact that we don't import all of the Bioportal ontologies because a lot of them are junk

Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Have discussion about updating policy on criteria for notability + correctness
  2. Update curation guidelines for semi-automated literature curation accordingly
  3. Check if the gene namespace is just a provider for hgnc.symbol

@bgyori
Copy link
Contributor

bgyori commented Oct 24, 2024

I agree with the general idea of using discretion to determine notability. We should probably come up with a curation tag in the paper curation tsv which expresses something like "relevant but not notable" (this would be a positive training sample for machine learning purposes but something that ended up not being added to the Bioregistry).

In this case, this is a resource published in Database (Oxford) with a working website so I wouldn't dismiss it apriori as not notable, and the content seems to be pretty useful. Overall, curating it simply as a provider for hgnc.symbol would be appropriate.

@nagutm nagutm changed the title Add prefix: cbgda.gene and cbgda.disease Add new provider for hgnc.symbol: cbgda Oct 24, 2024
Copy link

codecov bot commented Oct 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 43.49%. Comparing base (8950e70) to head (e35d8f7).
Report is 130 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1225      +/-   ##
==========================================
+ Coverage   42.51%   43.49%   +0.97%     
==========================================
  Files         117      118       +1     
  Lines        8327     8190     -137     
  Branches     1963     1346     -617     
==========================================
+ Hits         3540     3562      +22     
+ Misses       4582     4464     -118     
+ Partials      205      164      -41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nagutm
Copy link
Collaborator Author

nagutm commented Oct 24, 2024

I think that not_notable should work as a tag to describe these types of resources for the future. If you agree, I can move forward with updating the relevancy vocabulary in the necessary files.
@bgyori

@bgyori
Copy link
Contributor

bgyori commented Oct 28, 2024

I think that not_notable should work as a tag to describe these types of resources for the future. If you agree, I can move forward with updating the relevancy vocabulary in the necessary files. @bgyori

Yes, we can define a tag like that on a separate PR and use it in the future whenever appropriate.

bgyori pushed a commit that referenced this pull request Oct 30, 2024
…curation workflow (#1236)

This pull request adds the `not_notable` tag to the CurationRelevance
vocabulary as a way to mark papers that are relevant for machine
learning training but do not meet the threshold for inclusion in the
Bioregistry.

While curating papers, there have been a few instances of entries that
provide new identifier information but aren't notable enough, or
well-maintained enough for inclusion in the bioregistry
(#1225). Rather than
curating these as subpar prefixes, tagging them as `not_notable` allows
us to retain them as positive training samples without cluttering the
bioregistry with less impactful entries.

Co-authored-by: Mufaddal Naguthanawala <[email protected]>
@cthoyt cthoyt merged commit 84581d5 into biopragmatics:main Nov 2, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants