-
Notifications
You must be signed in to change notification settings - Fork 41
Data Generation and Curation
Trevor Keller (@tkphd) and Daniel Wheeler (@wd15)
- Save the data from your published work as much as possible, with meta data
- Save the inputs used to produce the results from all your published work
We discussed the FAIR Principles at CHiMaD Phase-Field XIII:
- (Meta)data are assigned a globally unique and persistent identifier
- Data are described with rich metadata (defined by R1 below)
- Metadata clearly and explicitly include the identifier of the data they describe
- (Meta)data are registered or indexed in a searchable resource
- (Meta)data are retrievable by their identifier using a standardized
communications protocol
- The protocol is open, free, and universally implementable
- The protocol allows for an authentication and authorisation procedure, where necessary
- Metadata are accessible, even when the data are no longer available
- (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
- (Meta)data use vocabularies that follow FAIR principles
- (Meta)data include qualified references to other (meta)data
- (Meta)data are richly described with a plurality of accurate and relevant attributes
- (Meta)data are released with a clear and accessible data usage license
- (Meta)data are associated with detailed provenance
- (Meta)data meet domain-relevant community standards
Historically, PFHub has accepted datasets linked from any host on the Web. At this time, we recommend using Zenodo to host your benchmark data. Why? It's not "just" a shared folder.
- Guided prompts to describe what you're uploading
- DOI is automatically assigned to your dataset
- Basic metadata exported in multiple formats
- Browser-based viewers for CSV, Markdown, PDF, images, videos
Zenodo gives you the option to import a repository directly from GitHub. The original FAIR Phase-field talk was "uploaded" this way, producing the following record. While basic authorship information was captured, this tells an interested person or machine nothing meaningful about the dataset.
{
"@context": "https://schema.org/",
"@id": "https://doi.org/10.5281/zenodo.6540105",
"@type": "SoftwareSourceCode",
"name": "tkphd/fair-phase-field-data: CHiMaD Phase-field XIII",
"description": "FAIR Principles for Phase-Field Practitioners",
"version": "v0.1.0",
"license": "",
"identifier": "https://doi.org/10.5281/zenodo.6540105",
"url": "https://zenodo.org/record/6540105",
"datePublished": "2022-05-11",
"creator": [{
"@type": "Person",
"givenName": "Trevor",
"familyName": "Keller",
"affiliation": "NIST"}],
"codeRepository": "https://github.com/tkphd/fair-phase-field-data/tree/v0.1.0"
}
The strongly preferred method is to upload files directly. The following record represents an upload for Benchmark 1b using HiPerC. I would consider this metadata rich!
{
"@context": "https://schema.org/",
"@id": "https://doi.org/10.5281/zenodo.1124941",
"@type": "Dataset",
"name": "hiperc-gpu-cuda-spinodal"
"description": "Solution to the CHiMaD Phase Field benchmark problem on spinodal decomposition using CUDA, with a 9-point discrete Laplacian stencil",
"identifier": "https://doi.org/10.5281/zenodo.1124941",
"license": "https://creativecommons.org/licenses/by/4.0/legalcode",
"url": "https://zenodo.org/record/1124941",
"datePublished": "2017-12-21",
"creator": [{
"@type": "Person",
"@id": "https://orcid.org/0000-0002-2920-8302",
"givenName": "Trevor",
"familyName": "Keller",
"affiliation": "NIST"}],
"keywords": ["phase-field", "pfhub", "chimad"],
"sameAs": ["https://doi.org/10.6084/m9.figshare.5715103.v2"],
"distribution": [
{
"@type": "DataDownload",
"contentUrl": "https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/free-energy-9pt.csv",
"encodingFormat": "csv"
}, {
"@type": "DataDownload",
"contentUrl": "https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal.0000000.png",
"encodingFormat": "png"
}, {
"@type": "DataDownload",
"contentUrl": "https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal.0100000.png",
"encodingFormat": "png"
}, {
"@type": "DataDownload",
"contentUrl": "https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal.0200000.png",
"encodingFormat": "png"
}]
}
After uploading the HiPerC simulation data, I also registered it with PFHub using meta.yaml
. This file tells the website-generating machinery what to do with the dataset, and provides additional information about the resources required to perform the simulation.
---
benchmark:
id: 1b
version: '1'
data:
- name: run_time
values:
- sim_time: '200000'
wall_time: '7464'
- name: memory_usage
values:
- unit: KB
value: '308224'
- description: free energy data
url: https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/free-energy-9pt.csv
- description: microstructure at t=0
type: image
url: https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal-000000.png
- description: microstructure at t=100,000
type: image
url: https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal-100000.png
- description: microstructure at t=200,000
type: image
url: https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal-200000.png
metadata:
author:
email: [email protected]
first: Trevor
github_id: tkphd
last: Keller
hardware:
acc_architecture: gpu
clock_rate: '1.48'
cores: '1792'
cpu_architecture: x86_64
nodes: 1
parallel_model: threaded
implementation:
repo:
url: https://github.com/usnistgov/hiperc
version: b25b14acda7c5aef565cdbcfc88f2df3412dcc46
simulation_name: hiperc_cuda
summary: HiPerC spinodal decomposition result using CUDA on a Tesla P100
timestamp: 18 December, 2017
This file is not part of my dataset: it resides in the PFHub repository on GitHub. Furthermore, since the structure of this file specifically suits PFHub, it is of no use at all to other software, websites, or researchers.
In the Zenodo metadata above, note the @context
fields: Schema.org is a structured data schema and controlled vocabulary for describing things on the Internet. How is this useful?
Consider the CodeMeta project. It creates metadata files for software projects using Schema.org building blocks. There's even a handy CodeMeta Generator! If you maintain a phase-field software framework, you can (and should!) use it to document your code in a standards-compliant, machine-readable format. This improves interoperability and reusability!
{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"@type": "SoftwareSourceCode",
"license": "https://spdx.org/licenses/CC-PDDC",
"codeRepository": "git+https://github.com/usnistgov/hiperc",
"dateCreated": "2017-08-07",
"dateModified": "2019-03-04",
"downloadUrl": "https://github.com/usnistgov/hiperc/releases/tag/v1.0",
"issueTracker": "https://github.com/usnistgov/hiperc/issues",
"name": "HiPerC",
"version": "1.0.0",
"description": "High-Performance Computing in C and CUDA",
"applicationCategory": "phase-field",
"developmentStatus": "inactive",
"programmingLanguage": ["C", "CUDA", "OpenCL", "OpenMP", "TBB"],
"author": [
{
"@type": "Person",
"@id": "https://orcid.org/my-orcid?orcid=0000-0002-2920-8302",
"givenName": "Trevor",
"familyName": "Keller",
"email": "[email protected]",
"affiliation": {
"@type": "Organization",
"name": "NIST"
}
}
]
}
That's nice! But what about our datasets? Shouldn't the PFHub metadata "describing" a dataset live alongside that data?
We are working to build a phase-field schema (or schemas) using Schema.org and the schemaorg Python library. The work-alike port of meta.yaml
looks like the following.
N.B.: We're going to deploy a generator similar to CodeMeta's so you won't have to write this!
{
"@context": "https://www.schema.org",
"@type": "DataCatalog",
"author": [
{
"@type": "Person",
"affiliation": {
"@type": "GovernmentOrganization",
"name": "Materials Science and Engineering Division",
"parentOrganization": {
"@type": "GovernmentOrganization",
"name": "Material Measurement Laboratory",
"parentOrganization": {
"@type": "GovernmentOrganization",
"address": {
"@type": "PostalAddress",
"addressCountry": "US",
"addressLocality": "Gaithersburg",
"addressRegion": "Maryland",
"postalCode": "20899",
"streetAddress": "100 Bureau Drive"
},
"identifier": "NIST",
"name": "National Institute of Standards and Technology",
"parentOrganization": "U.S. Department of Commerce",
"url": "https://www.nist.gov"
}
}
},
"email": "[email protected]",
"familyName": "Keller",
"givenName": "Trevor",
"identifier": "tkphd",
"sameAs": "https://orcid.org/0000-0002-2920-8302"
}, {
"@type": "Person",
"affiliation": {
"@type": "GovernmentOrganization",
"name": "Materials Science and Engineering Division",
"parentOrganization": {
"@type": "GovernmentOrganization",
"name": "Material Measurement Laboratory",
"parentOrganization": {
"@type": "GovernmentOrganization",
"address": {
"@type": "PostalAddress",
"addressCountry": "US",
"addressLocality": "Gaithersburg",
"addressRegion": "Maryland",
"postalCode": "20899",
"streetAddress": "100 Bureau Drive"
},
"identifier": "NIST",
"name": "National Institute of Standards and Technology",
"parentOrganization": "U.S. Department of Commerce",
"url": "https://www.nist.gov"
}
}
},
"email": "[email protected]",
"familyName": "Wheeler",
"givenName": "Daniel",
"identifier": "wd15",
"sameAs": "https://orcid.org/0000-0002-2653-7418"
}
],
"dataset": [
{
"@type": "Dataset",
"distribution": [
{
"@type": "PropertyValue",
"name": "parallel_nodes",
"value": 1
}, {
"@type": "PropertyValue",
"name": "cpu_architecture",
"value": "amd64"
}, {
"@type": "PropertyValue",
"name": "parallel_cores",
"value": 12
}, {
"@type": "PropertyValue",
"name": "parallel_gpus",
"value": 1
}, {
"@type": "PropertyValue",
"name": "gpu_architecture",
"value": "nvidia"
}, {
"@type": "PropertyValue",
"name": "gpu_cores",
"value": 6144
}, {
"@type": "PropertyValue",
"name": "wall_time",
"unitCode": "SEC",
"unitText": "s",
"value": 384
}, {
"@type": "PropertyValue",
"name": "memory_usage",
"unitCode": "E63",
"unitText": "mebibyte",
"value": 1835
}
],
"name": "irl"
}, {
"@type": "Dataset",
"distribution": [
{
"@type": "DataDownload",
"contentUrl": "8a/free_energy_1.csv",
"name": "free energy"
}, {
"@type": "DataDownload",
"contentUrl": "8a/solid_fraction_1.csv",
"name": "solid fraction"
}, {
"@type": "DataDownload",
"contentUrl": "8a/free_energy_2.csv",
"name": "free energy"
}, {
"@type": "DataDownload",
"contentUrl": "8a/solid_fraction_2.csv",
"name": "solid fraction"
}, {
"@type": "DataDownload",
"contentUrl": "8a/free_energy_3.csv",
"name": "free energy"
}, {
"@type": "DataDownload",
"contentUrl": "8a/solid_fraction_3.csv",
"name": "solid fraction"
}
],
"name": "output"
}
],
"dateCreated": "2022-10-25T19:25:02+00:00",
"description": "A fake dataset for Benchmark 8a unprepared using FiPy by @tkphd & @wd15",
"isBasedOn": {
"@type": "SoftwareSourceCode",
"codeRepository": "https://github.com/tkphd/fake-pfhub-bm8a",
"description": "Fake benchmark 8a upload with FiPy",
"runtimePlatform": "fipy",
"targetProduct": "amd64",
"version": "9df6603e"
},
"isPartOf": {
"@type": "Series",
"identifier": "8a",
"name": "Homogeneous Nucleation",
"url": "https://pages.nist.gov/pfhub/benchmarks/benchmark8.ipynb"
},
"keywords": [
"phase-field",
"benchmarks",
"pfhub",
"fipy",
"homogeneous-nucleation"
],
"license": "https://www.nist.gov/open/license#software"
}
- Home
- Benchmark Presentations
- Peter Voorhees Phase Field Lectures
- Phase-Field Method Recommended Practices
- Workshop Presentations
- Miscellaneous