Add guide for semi-automated curation workflow (#1195)

Co-authored-by: Mufaddal Naguthanawala <[email protected]> Co-authored-by: Charles Tapley Hoyt <[email protected]> Co-authored-by: Benjamin M. Gyori <[email protected]>
biopragmatics · Nov 17, 2024 · 54a91e1 · 54a91e1
1 parent 1a997da
commit 54a91e1
Show file tree

Hide file tree

Showing 2 changed files with 267 additions and 0 deletions.
diff --git a/docs/guides/README.md b/docs/guides/README.md
@@ -4,6 +4,7 @@ This folder contains various task-specific curation guides.
 
 - [Curating new providers](curation/providers)
 - [Curating new publications and references](curation/publications)
+- [Semi-automated curation workflow for new prefixes, providers, and publications](curation/literature)
 
 ## How to add new guides
 
@@ -13,6 +14,9 @@ This folder contains various task-specific curation guides.
    for an example)
 3. Add it to the list above. Don't include a forward slash `/` in the beginning
    of the link!
+4. Make sure you run
+   `npx prettier --prose-wrap always --check "**/*.md" --write` to properly
+   format your markdown
 
 ## What makes a good guide
 

diff --git a/docs/guides/literature_curation.md b/docs/guides/literature_curation.md
@@ -0,0 +1,263 @@
+---
+layout: page
+title: Semi-automated Curation of New Prefixes, Providers, and Publications
+permalink: /curation/literature
+---
+
+The Bioregistry aims to establish a comprehensive resource for the curation of
+biological identifiers. By efficiently identifying relevant resources, curators
+help expand the Bioregistry’s utility for the wider scientific community. This
+guide offers a structured approach for curators to assess and classify new
+information, ensuring that updates to the Bioregistry are both precise and
+thorough.
+
+The Bioregistry uses a machine learning model to automatically identify PubMed
+papers that are potential candidates for curation. Each month, the model
+produces a ranked list of papers based on their relevance to the Bioregistry.
+These papers are relevant for expanding the Bioregistry in at least three ways:
+
+1. As a **new prefix** for a resource providing primary identifiers,
+2. As a **new provider** for resolving existing identifiers,
+3. As a **new publication** related to an existing prefix in the Bioregistry.
+
+This guide provides a working table of `relevancy_type` tags, which are used to
+classify the relevance of each paper. Curators can use these tags to categorize
+papers during the review process. The tags are part of the following
+[TSV file](https://github.com/biopragmatics/bioregistry/blob/main/src/bioregistry/data/curated_papers.tsv).
+These updates help retrain the model, improving its accuracy over time.
+
+The ranked list of suggested papers can be found
+[here](https://github.com/biopragmatics/bioregistry/issues/1165). When reviewing
+a paper, curators should update the TSV file with the following information:
+
+- `pmid`: The PubMed ID of the paper being reviewed.
+- `relevant`: 1 for relevant, 0 for not relevant.
+- `orcid`: The ORCID of the curator reviewing the paper.
+- `date_curated`: The date the paper was reviewed.
+- `relevancy_type`: The type of relevance as defined in the table below.
+- `pr_added`: The pull request number associated with the curation.
+- `notes`: Any additional notes or comments regarding the paper's relevance or
+  findings.
+
+## Relevancy Type Table
+
+This table of `relevancy_type` tags is continuously evolving as new papers are
+evaluated and is subject to change in the future.
+
+| Key                      | Definition                                                                                 |
+| ------------------------ | ------------------------------------------------------------------------------------------ |
+| new_prefix               | A resource for new primary identifiers                                                     |
+| new_provider             | A resolver for existing identifiers                                                        |
+| new_publication          | A new publication for an existing prefix                                                   |
+| not_identifiers_resource | A database, but not for identifier information                                             |
+| no_website               | Paper suggestive of a new database, but no link to website provided                        |
+| existing                 | An existing entry in the bioregistry                                                       |
+| unclear                  | Not clear how to curate in the bioregistry, follow up discussion required                  |
+| irrelevant_other         | Completely unrelated information                                                           |
+| not_notable              | Relevant for training purposes, but not curated in Bioregistry due to poor/unknown quality |
+
+## Common Mistakes
+
+New curators may encounter some common challenges when reviewing papers and
+curating data. Below are a few mistakes to be aware of, along with tips on how
+to avoid them:
+
+**1. Confusing Databases with Semantic Spaces**
+
+One common mistake is focusing on describing the database rather than the
+semantic space it organizes. A database provides structured data, such as
+identifiers, while a semantic space organizes entities and their relationships
+within a conceptual framework.
+
+When curating a resource, the Bioregistry record should describe the semantic
+space, that is, the entities and relationships the resource represents rather
+than the database itself. Explore the resource to identify multiple potential
+semantic spaces and curate separate prefixes for each entity type if necessary.
+The goal is to capture how the resource organizes and relates concepts, not just
+the data it holds.
+
+**2. Mislabeling Existing Resources as New**
+
+Another common mistake is labeling an `existing` resource as a `new_prefix` or
+`new_provider`. Before curating a `new_prefix` or `new_provider`, first check if
+the resource is already listed in the Bioregistry. If the resource exists,
+consider whether the paper might be introducing a `new_publication` associated
+with that resource, rather than a completely new entry. This prevents duplicate
+entries for existing resources.
+
+**3. Misunderstanding the Scope of Irrelevant Information**
+
+Not every paper mentioning biological resources is relevant to the Bioregistry.
+Papers that discuss databases not focused on identifier information, for
+example, should be marked as `not_identifiers_resource`. Similarly, entirely
+unrelated papers should be tagged as `irrelevant_other`. Being clear on the
+scope of the Bioregistry’s focus can help avoid curating irrelevant materials.
+
+## Curation and Data Synchronization
+
+When curators add rows to the curation TSV file, these entries should correspond
+to specific changes made in the Bioregistry data files. Each pull request should
+encompass both the updates to the TSV file and the relevant modifications to the
+data files in the Bioregistry repository.
+
+## Step-by-Step Example to Curating a New Prefix
+
+The following step-by-step example is for the resource
+[SCancerRNA](http://www.scancerrna.com/) based on the publication
+[SCancerRNA: Expression at the Single-cell Level and Interaction Resource of Non-coding RNA Biomarkers for Cancers](https://pubmed.ncbi.nlm.nih.gov/39341795/).
+
+**1. Assess the Database for Identifier Creation**
+
+Begin by exploring the database to determine if it generates new identifiers for
+life sciences entities. This is an investigative process, and there isn’t a
+one-size-fits-all approach; however, most databases typically have a Browse or
+Search section, which serves as a good starting point. Take your time to
+navigate various categories to confirm that the resource creates relevant
+identifiers. Once verified, proceed to fill out the TSV file with the
+preliminary information you gathered.
+
+| pmid     | relevant | orcid               | date_curated | relevancy_type | pr_added | notes                                                |
+| -------- | -------- | ------------------- | ------------ | -------------- | -------- | ---------------------------------------------------- |
+| 39341795 | 1        | 0009-0009-5240-7463 | 2024-10-19   | new_prefix     | 1215     | identifiers of non-coding RNA biomarkers for cancers |
+
+**2. Collect Essential Information**
+
+Gather easily accessible information for the resource, such as:
+
+- Name and Email for a point of contact (github and ORCID if possible as well)
+- Example identifier
+- Homepage URL
+- Name of the resource
+- Publication information (such as PubMed ID, DOI, title, year)
+- URI format to resolve identifiers
+
+This data will be necessary for filling out the Bioregistry record.
+
+**3. Write a Brief Description**
+
+Create a concise description that explains what kind of entities the resource
+makes identifiers for and its general purpose.
+
+**4. Write a Regex Pattern**
+
+Examine the format of the identifiers used by the resource and write a regex
+pattern to validate this format. It’s better to create a pattern that is
+somewhat flexible to accommodate potential future identifier additions.
+
+**5. Update `bioregistry.json`**
+
+```json
+"scancerna": {
+    "contact": {
+      "email": "[email protected]",
+      "name": "Tianyi Zhao",
+      "orcid": "0000-0001-7352-2959"
+    },
+    "contributor": {
+      "email": "[email protected]",
+      "github": "nagutm",
+      "name": "Mufaddal Naguthanawala",
+      "orcid": "0009-0009-5240-7463"
+    },
+    "description": "SCancerRNA provides identifiers for non-coding RNA biomarkers, including long ncRNA, microRNA, PIWI-interacting RNA, small nucleolar RNA, and circular RNA, with data on their differential expression at the cellular level in cancer.",
+    "example": "9530",
+    "github_request_issue": 1215,
+    "homepage": "http://www.scancerrna.com/",
+    "name": "SCancerRNA",
+    "pattern": "^\\d+$",
+    "publications": [
+      {
+        "doi": "10.1093/gpbjnl/qzae023",
+        "pubmed": "39341795",
+        "title": "SCancerRNA: Expression at the Single-cell Level and Interaction Resource of Non-coding RNA Biomarkers for Cancers",
+        "year": 2024
+      }
+    ],
+    "uri_format": "http://www.scancerrna.com/toDetail?id=$1"
+  },
+```
+
+**6. Submit a Pull Request**
+
+Submit a pull request with the changes you made to both the TSV file and the
+`bioregistry.json` file. Make sure the PR includes all necessary updates.
+
+## Example Prefix Curation with Multiple Semantic Spaces
+
+In this example, two prefixes have been curated from the Asteraceae Genome
+Database (AGD), based on the publication
+[Asteraceae Genome Database: A Comprehensive Platform for Asteraceae Genomics](https://pmc.ncbi.nlm.nih.gov/articles/PMC11366637/).
+
+The dot notation is used to indicate that both `asteraceaegd.genome` and
+`asteraceaegd.plant` are part of the same overarching resource (AGD), but each
+prefix represents a distinct semantic space:
+
+- `asteraceaegd.genome` focuses on the genomic information for Asteraceae
+  species.
+- `asteraceaegd.plant` focuses on the broader phenotypic and genetic data about
+  Asteraceae plants.
+
+By curating separate prefixes for each semantic space, the Bioregistry ensures
+clear and precise representation of the different types of data provided by the
+AGD. This approach allows users to distinguish between the different kinds of
+identifiers and the types of biological information they refer to within the
+same database.
+
+```json
+"asteraceaegd.genome": {
+    "contact": {
+      "email": "[email protected]",
+      "name": "Wei Chen"
+    },
+    "contributor": {
+      "email": "[email protected]",
+      "github": "nagutm",
+      "name": "Mufaddal Naguthanawala",
+      "orcid": "0009-0009-5240-7463"
+    },
+    "description": "The AGD is an integrated database resource dedicated to collecting the genomic-related data of the Asteraceae family. This collection refers to the genomic data of Asteraceae species.",
+    "example": "0002",
+    "github_request_issue": 1214,
+    "homepage": "https://cbcb.cdutcm.edu.cn/AGD/",
+    "name": "Asteraceae Genome Database",
+    "pattern": "^\\d{4}$",
+    "publications": [
+      {
+        "doi": "10.3389/fpls.2024.1445365",
+        "pmc": "PMC11366637",
+        "pubmed": "39224843",
+        "title": "Asteraceae genome database: a comprehensive platform for Asteraceae genomics",
+        "year": 2024
+      }
+    ],
+    "uri_format": "https://cbcb.cdutcm.edu.cn/AGD/genome/details/?id=$1"
+  },
+"asteraceaegd.plant": {
+    "contact": {
+      "email": "[email protected]",
+      "name": "Wei Chen"
+    },
+    "contributor": {
+      "email": "[email protected]",
+      "github": "nagutm",
+      "name": "Mufaddal Naguthanawala",
+      "orcid": "0009-0009-5240-7463"
+    },
+    "description": "The AGD is an integrated database resource dedicated to collecting the genomic-related data of the Asteraceae family. This collections refers to the broader phenotypic and genetic resources of Asteraceae plants.",
+    "example": "0016",
+    "github_request_issue": 1214,
+    "homepage": "https://cbcb.cdutcm.edu.cn/AGD/",
+    "name": "Asteraceae Genome Database",
+    "pattern": "^\\d{4}$",
+    "publications": [
+      {
+        "doi": "10.3389/fpls.2024.1445365",
+        "pmc": "PMC11366637",
+        "pubmed": "39224843",
+        "title": "Asteraceae genome database: a comprehensive platform for Asteraceae genomics",
+        "year": 2024
+      }
+    ],
+    "uri_format": "https://cbcb.cdutcm.edu.cn/AGD/plant/details/?id=$1"
+  },
+```