Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate GUIDs for public GTEx v7 datafiles into DATS JSON. #6

Open
jonathancrabtree opened this issue Jun 12, 2018 · 3 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@jonathancrabtree
Copy link
Contributor

Incorporate GTEx GUIDs from KC2 into the DATS JSON. These GUIDs were described in a recent e-mail from Martin Fenner:

I have registered GUIDs for the first small set of GTEx datasets. This is the 23 datasets at the top of https://www.gtexportal.org/home/datasets, and I used DOIs and the core metadata we described in the KC2 document submitted last week.

You can access the DOIs via API or web interface
https://api.datacite.org/works?data-center-id=datacite.gtex
https://search.datacite.org/data-centers/datacite.gtex

The schema.org metadata are embedded in the individual page for each DOI and available via content negotiation:
https://search.datacite.org/works/10.25491/t9mt-na55
https://data.datacite.org/application/vnd.schemaorg.ld+json/10.25491/t9mt-na55 or with https://data.datacite.org/10.25491/t9mt-na55 and application/vnd.schemaorg.ld+json as content-type header.

Special thanks to Jared and the GTEx team for working with Team Sodium on this.

Next steps are individual landing pages for each dataset, providing a direct link to the content in the metadata, and integrating this with bdbags. In parallel we will move forward with registering more GUIDs so that they can be used for planning the demos that need those GUIDs.

@jonathancrabtree jonathancrabtree self-assigned this Jun 13, 2018
@jonathancrabtree
Copy link
Contributor Author

My understanding is that each of these GUIDs (e.g., a URI like "https://doi.org/10.25491/t9mt-na55") should be used as the JSON-LD id of the corresponding DATS Dataset. A couple of observations from looking at the JSON for the GUID (curl -X GET 'https://data.datacite.org/application/vnd.schemaorg.ld+json/10.25491/t9mt-na55'):

  • The author is of type "Person" and not "Organization", like the publisher. The schema.org page seems to say that either is acceptable for author and in both cases (author and publisher) an organization (the GTEx Consortium) rather than a person is specified. See excerpt below (with ampersands removed):

"author": {
"type": "Person",
"name": "The GTEx Consortium",
"givenName": "The GTEx",
"familyName": "Consortium"
},
"version": "v7",
"keywords": "gtex, annotation, phenotype, gene regulation, transcriptomics",
"datePublished": "2017",
"isPartOf": {
"type": "DataCatalog",
"name": "GTEx"
},
"schemaVersion": "http://datacite.org/schema/kernel-4",
"publisher": {
"type": "Organization",
"name": "GTEx"
},

{
"context": "http://schema.org",
"type": "Dataset",
"id": "https://doi.org/10.25491/t9mt-na55",
"identifier": "https://doi.org/10.25491/t9mt-na55",
"url": "https://www.gtexportal.org/home/datasets",
"additionalType": "Transcript TPMs",
"name": "Transcript TPMs",

I'm not sure what the standard practices are for generating GUIDs for data files, so perhaps this is how this would normally be done. My main concern is that linking the GUID to the actual data file in an automated fashion appears to boil down to parsing an HTML table and matching up a value in one column in order to obtain the filename in another.

@jonathancrabtree
Copy link
Contributor Author

GUIDs/DOIs have been added to the 7 RNA-Seq Datasets currently in the public GTEx v7 DATS. A couple of issues still have to be resolved, however. One is where exactly the DOIs should go. I've placed them in the JSON-LD id field of each Dataset, but not in the identifier, where they presumably also belong. @proccaserra, @agbeltran, does this sound right? In other words right now we have this (ampersands removed to prevent GitHub from linking to user accounts):

 "type": "Dataset",
  "context": "https://w3id.org/dats/context/sdo/dataset_context.jsonld",
  "id": "https://doi.org/10.25491/zzv1-xb48",
  "identifier": {
    "type": "Identifier",
    "id": "",
    "identifier": "GTEx_Analysis_2016-01-15_v7_RNA-SEQ_GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_reads.gct.gz"
  },
  "version": "v7",

And my suggestion/understanding is that the DOI URI also belongs in the identifier field of the identifier object, where currently the filename resides (and an identifierSource should also be added.)

The other unresolved issue is that the DOIs for the subject and sample metadata files (which were used to generate the GTEx DATS) should also be added somewhere:

id= https://doi.org/10.25491/sx2w-0730 title=A de-identified, open access version of the subject phenotypes available in dbGaP
id= https://doi.org/10.25491/9pwt-7167 title=A de-identified, open access version of the sample annotations available in dbGaP
id= https://doi.org/10.25491/0w9a-h514 title=A data dictionary that describes each variable in the GTEx_v7_Annotations_SampleAttributesDS.txt
id= https://doi.org/10.25491/5e92-ht74 title=A data dictionary that describes each variable in the GTEx_v7_Annotations_SubjectPhenotypesDS.txt

@jonathancrabtree
Copy link
Contributor Author

DOIs are now in v0.3 release although the issue of whether to repeat the DOI in the Identifier (and move the filename elsewhere) has been punted on for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants