Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comments on plan for AGR/MGI in the README #21

Open
cmungall opened this issue Aug 1, 2018 · 2 comments
Open

Comments on plan for AGR/MGI in the README #21

cmungall opened this issue Aug 1, 2018 · 2 comments

Comments

@cmungall
Copy link

cmungall commented Aug 1, 2018

I see the README has details on how Alliance data is to be encoded in DATS, thanks for adding this.

Are there any example JSON files?

The README says:

AGR/MGI encoding
The preliminary encoding for the MGI mouse reference genome annotation is quite simple

This is a bit confusing. The AGR (preferred name: Alliance) is more than MGI. Is the plan to get data directly from MGI? Or to get mouse data from the Alliance (which may temporarily be less complete than what is obtained from MGI), or to get all species data from the Alliance.

I think it should be all species data, not sure why MGI is highlighted specifically?

The HomoloGene ids and HomoloGene-derived human gene ids in relatedIdentifiers...

Human homologs should be obtained from the Alliance, this will be more accurate than Homologene

Overall comments:

The KC7 products google doc says that expression data will be captured from the Alliance (or at least from MGI) but the example in the README is just the basic gene information. Also the Alliance is producing gene to phenotype that is of broad interest. How should this be resolved?

It looks like the datamodel used is a generic one in which arbitrary Dimensions and CategoryValuePairs can be attached to abritrary molecular entities. I think there are some advantages to such a generic model but I question whether this is the best way of representing what is in knowledge bases like the Alliance. It feels like an impedance mismatch. In the diagram:

image

This just seems like a slightly awkward way of expressing what can be expressed more accurately in a line of GFF3 or in the Alliance's own native JSON format. It's not clear how well the dimension model will adapt to richer data from the alliance, e.g. expression or phenotype.

I propose that we simultaneously evaluate the biolink model for knowledge resources such as the Alliance. This would incur additional cost on the full stacks if they want to support both but it would be interesting to compare.

@proccaserra
Copy link
Contributor

@cmungall this initial "MGI" DATS file is more a range finding exercise than anything else. We have discussed several times already that for such molecular information/genome annotation information, there may be little value in creating yet another representation.
If you and the alliance can produce a JSON instance and/or the RDF/xml for AGR information, it could be used to complement DATS coverage of datasets.
This brings again the key question: what are the query cases ? how do people want to cast their net.
We are all reading the use-cases documents.

@proccaserra
Copy link
Contributor

@cmungall also see issue #20
#20, which discusses similar issues to those you raised. We discussed this with @jonathancrabtree and @agbeltran.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants