Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training datasets for ML/AI - publication centric #1181

Open
ValWood opened this issue Jun 18, 2024 · 7 comments
Open

Training datasets for ML/AI - publication centric #1181

ValWood opened this issue Jun 18, 2024 · 7 comments
Assignees

Comments

@ValWood
Copy link
Member

ValWood commented Jun 18, 2024

Create a "publication centric" file containing all entities / annotations (all datatypes) for each publication.

Json?

@kimrutherford kimrutherford self-assigned this Jun 19, 2024
@kimrutherford
Copy link
Member

JSON makes sense. How urgent is this?

@ValWood
Copy link
Member Author

ValWood commented Jun 19, 2024

It would be good to have it in a few of weeks I think to keep the ball rolling. I'm meeting the ePMC ML person on Monday if you want to join (forwarded the invite)
v

@kimrutherford
Copy link
Member

kimrutherford commented Jun 21, 2024

I'll start this on Monday. It might take a couple of days because the existing code needs improving first. A lot was written in a hurry for PomBase v2. Now I've had time (7 years?) to think about it, there are better ways to do things.

Proposed JSON structure (work in progress):

PMID:

  • MF annotations
    • gene
    • term
    • evidence
    • extension (as JSON)
    • comment
  • BP annotations
    • (same details)
  • CC annotations
    • (same details)
  • FYPO annotations
    • genotype
      • loci (one or more)
        • alleles (one or more):
          • type
          • name
          • description
    • term
    • evidence
    • conditions (as JSON)
    • extension (as JSON)
    • comment
  • Modification annotations
    • ...

@kimrutherford kimrutherford changed the title Trainig datasets for ML/AI publication centric Training datasets for ML/AI - publication centric Jun 25, 2024
@kimrutherford
Copy link
Member

From Zoom: make sure to include annotation comments in the output.

@ValWood
Copy link
Member Author

ValWood commented Jun 26, 2024

related: #1185 we'll discuss this on the next call....

@kimrutherford
Copy link
Member

After the chat with ePMC a while ago, I'm wondering if it's useful to create a file like this in advance. It sounded like there are particular file formats that each group uses. So perhaps we should create files when asked? Unless it's a especially wacky format I think I could create files on request with a 24 turn-around.

@ValWood
Copy link
Member Author

ValWood commented Aug 22, 2024

OK keep this on the back burner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants