Training datasets for ML/AI - publication centric #1181

ValWood · 2024-06-18T15:17:17Z

Create a "publication centric" file containing all entities / annotations (all datatypes) for each publication.

Json?

kimrutherford · 2024-06-19T10:24:26Z

JSON makes sense. How urgent is this?

ValWood · 2024-06-19T10:43:44Z

It would be good to have it in a few of weeks I think to keep the ball rolling. I'm meeting the ePMC ML person on Monday if you want to join (forwarded the invite)
v

kimrutherford · 2024-06-21T06:08:12Z

I'll start this on Monday. It might take a couple of days because the existing code needs improving first. A lot was written in a hurry for PomBase v2. Now I've had time (7 years?) to think about it, there are better ways to do things.

Proposed JSON structure (work in progress):

PMID:

MF annotations
- gene
- term
- evidence
- extension (as JSON)
- comment
BP annotations
- (same details)
CC annotations
- (same details)
FYPO annotations
- genotype
  - loci (one or more)
    - alleles (one or more):
      - type
      - name
      - description
- term
- evidence
- conditions (as JSON)
- extension (as JSON)
- comment
Modification annotations
- ...

kimrutherford · 2024-06-25T23:58:17Z

From Zoom: make sure to include annotation comments in the output.

ValWood · 2024-06-26T10:58:28Z

related: #1185 we'll discuss this on the next call....

kimrutherford · 2024-08-21T23:54:28Z

After the chat with ePMC a while ago, I'm wondering if it's useful to create a file like this in advance. It sounded like there are particular file formats that each group uses. So perhaps we should create files when asked? Unless it's a especially wacky format I think I could create files on request with a 24 turn-around.

ValWood · 2024-08-22T05:45:33Z

OK keep this on the back burner.

kimrutherford self-assigned this Jun 19, 2024

kimrutherford added high priority next labels Jun 19, 2024

kimrutherford changed the title ~~Trainig datasets for ML/AI publication centric~~ Training datasets for ML/AI - publication centric Jun 25, 2024

ValWood added the AI-datasets label Jul 30, 2024

ValWood removed high priority next labels Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training datasets for ML/AI - publication centric #1181

Training datasets for ML/AI - publication centric #1181

ValWood commented Jun 18, 2024

kimrutherford commented Jun 19, 2024

ValWood commented Jun 19, 2024

kimrutherford commented Jun 21, 2024 •

edited

Loading

kimrutherford commented Jun 25, 2024

ValWood commented Jun 26, 2024

kimrutherford commented Aug 21, 2024

ValWood commented Aug 22, 2024

Training datasets for ML/AI - publication centric #1181

Training datasets for ML/AI - publication centric #1181

Comments

ValWood commented Jun 18, 2024

kimrutherford commented Jun 19, 2024

ValWood commented Jun 19, 2024

kimrutherford commented Jun 21, 2024 • edited Loading

kimrutherford commented Jun 25, 2024

ValWood commented Jun 26, 2024

kimrutherford commented Aug 21, 2024

ValWood commented Aug 22, 2024

kimrutherford commented Jun 21, 2024 •

edited

Loading