-
Notifications
You must be signed in to change notification settings - Fork 42
Data Generation and Curation
This page has moved to
https://pages.nist.gov/pf-recommended-practices/bp-guide-gh/ch3-data-generation-and-curation.html
- Trevor Keller, NIST, @tkphd
- Daniel Wheeler, NIST, @wd15
- Damien Pinto, McGill, [@DamienPinto]
- Look at lit on data and see how this is implemented
- Generation and dissemination
- Data formats
- FAIR
- Metadata (hierarchical data standards (look for current versions))
- What standards exist
- One or two examples of phase field data stored currently
- Use an existing
- Create our own example
- Practical choices for storing data (figshare, zenodo, dryad, MDF)
- Deciding what data to keep
- What data to store when publishing
- What is supplementary material versus store versus leave on hard drive
- Lit review, good citations
- minting DOIs for the data
- might include simulatioin execution and logging
- how frequently to store data
- how to store
- Intro (Daniel)
- What is data?
- What is metadata?
- Why do we need curate data?
- What is the motivation for this document?
- What should the reader get out of this document
- Why is this useful
- Create a distinction between software, data, and post-processed results
- Data Generation (Trevor)
- HPC
- file systems
- data formats
- formats not to use (e.g., don't use serialization that depends on the version of the code that reads and writes because code changes)
- don't use pickles
- restarts
- data frequency
- post-processing -> refer to other document for this
- precision
- importance of folder structure
- Data Curation (Trevor)
- Why is curating data important?
- Why do we need to store our data
- What formats can we use
- Is my data too large, data sizes
- What is useful for third party / subsequent users
- How might the data be used for AI or something else
- Could a reviewer make use of your curated data
- Storing post-processed data and raw data and which to store or keep
- Minting DOIs for your software when publishing a paper
- FAIR
- Metadata standards (Daniel)
- Why do we need to keep some metadata beyond the data
- Zoo of things like data dictionaries, ontologies
- however, these are not well developed for our use case
- For example, you curate on Zenodo
- what extra data should you include
- how to describe the data files
- how to maintain some minimalist info about the simulation that generated the run
- When, why and how I ran this simulation
- What software
- Give example of a yaml file with 10 flat fields
- The future should be better in this regard. People actively working to improve this issue.
- Examples
-
Practical examples (Trevor)
-
Using Zenodo for a PFHub record to store data and metadata
- Relatively rich metadata scheme
-
Simulation from scratch (Damien)
- data generation
- folder structure
- HPC issues with data
- capture process / descriptive parameters for the data that are useful for subsequent ML practitioners that use the data
- ML / store data
- Narrative of what gets stored to disk
- Decisions of what to keep and how frequently to save data
- Auxiliary metadata decisions
- data generation
- Summary (Daniel)
- Biblio (Daniel)
- Save the data from your published work as much as possible, with meta data
- Save the inputs used to produce the results from all your published work
We discussed the FAIR Principles at CHiMaD Phase-Field XIII:
- (Meta)data are assigned a globally unique and persistent identifier
- Data are described with rich metadata (defined by R1 below)
- Metadata clearly and explicitly include the identifier of the data they describe
- (Meta)data are registered or indexed in a searchable resource
- (Meta)data are retrievable by their identifier using a standardized
communications protocol
- The protocol is open, free, and universally implementable
- The protocol allows for an authentication and authorisation procedure, where necessary
- Metadata are accessible, even when the data are no longer available
- (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
- (Meta)data use vocabularies that follow FAIR principles
- (Meta)data include qualified references to other (meta)data
- (Meta)data are richly described with a plurality of accurate and relevant attributes
- (Meta)data are released with a clear and accessible data usage license
- (Meta)data are associated with detailed provenance
- (Meta)data meet domain-relevant community standards
Historically, PFHub has accepted datasets linked from any host on the Web. At this time, we recommend using Zenodo to host your benchmark data. Why? It's not "just" a shared folder.
- Guided prompts to describe what you're uploading
- DOI is automatically assigned to your dataset
- Basic metadata exported in multiple formats
- Browser-based viewers for CSV, Markdown, PDF, images, videos
Zenodo gives you the option to import a repository directly from GitHub. The original FAIR Phase-field talk was "uploaded" this way, producing the following record. While basic authorship information was captured, this tells an interested person or machine nothing meaningful about the dataset.
{
"@context": "https://schema.org/",
"@id": "https://doi.org/10.5281/zenodo.6540105",
"@type": "SoftwareSourceCode",
"name": "tkphd/fair-phase-field-data: CHiMaD Phase-field XIII",
"description": "FAIR Principles for Phase-Field Practitioners",
"version": "v0.1.0",
"license": "",
"identifier": "https://doi.org/10.5281/zenodo.6540105",
"url": "https://zenodo.org/record/6540105",
"datePublished": "2022-05-11",
"creator": [{
"@type": "Person",
"givenName": "Trevor",
"familyName": "Keller",
"affiliation": "NIST"}],
"codeRepository": "https://github.com/tkphd/fair-phase-field-data/tree/v0.1.0"
}
The strongly preferred method is to upload files directly. The following record represents an upload for Benchmark 1b using HiPerC. I would consider this metadata rich!
{
"@context": "https://schema.org/",
"@id": "https://doi.org/10.5281/zenodo.1124941",
"@type": "Dataset",
"name": "hiperc-gpu-cuda-spinodal"
"description": "Solution to the CHiMaD Phase Field benchmark problem on spinodal decomposition using CUDA, with a 9-point discrete Laplacian stencil",
"identifier": "https://doi.org/10.5281/zenodo.1124941",
"license": "https://creativecommons.org/licenses/by/4.0/legalcode",
"url": "https://zenodo.org/record/1124941",
"datePublished": "2017-12-21",
"creator": [{
"@type": "Person",
"@id": "https://orcid.org/0000-0002-2920-8302",
"givenName": "Trevor",
"familyName": "Keller",
"affiliation": "NIST"}],
"keywords": ["phase-field", "pfhub", "chimad"],
"sameAs": ["https://doi.org/10.6084/m9.figshare.5715103.v2"],
"distribution": [
{
"@type": "DataDownload",
"contentUrl": "https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/free-energy-9pt.csv",
"encodingFormat": "csv"
}, {
"@type": "DataDownload",
"contentUrl": "https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal.0000000.png",
"encodingFormat": "png"
}, {
"@type": "DataDownload",
"contentUrl": "https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal.0100000.png",
"encodingFormat": "png"
}, {
"@type": "DataDownload",
"contentUrl": "https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal.0200000.png",
"encodingFormat": "png"
}]
}
After uploading the HiPerC simulation data, I also registered it with PFHub using meta.yaml
. This file tells the website-generating machinery what to do with the dataset, and provides additional information about the resources required to perform the simulation.
---
benchmark:
id: 1b
version: '1'
data:
- name: run_time
values:
- sim_time: '200000'
wall_time: '7464'
- name: memory_usage
values:
- unit: KB
value: '308224'
- description: free energy data
url: https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/free-energy-9pt.csv
- description: microstructure at t=0
type: image
url: https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal-000000.png
- description: microstructure at t=100,000
type: image
url: https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal-100000.png
- description: microstructure at t=200,000
type: image
url: https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal-200000.png
metadata:
author:
email: [email protected]
first: Trevor
github_id: tkphd
last: Keller
hardware:
acc_architecture: gpu
clock_rate: '1.48'
cores: '1792'
cpu_architecture: x86_64
nodes: 1
parallel_model: threaded
implementation:
repo:
url: https://github.com/usnistgov/hiperc
version: b25b14acda7c5aef565cdbcfc88f2df3412dcc46
simulation_name: hiperc_cuda
summary: HiPerC spinodal decomposition result using CUDA on a Tesla P100
timestamp: 18 December, 2017
This file is not part of my dataset: it resides in the PFHub repository on GitHub. Furthermore, since the structure of this file specifically suits PFHub, it is of no use at all to other software, websites, or researchers.
In the Zenodo metadata above, note the @context
fields: Schema.org is a structured data schema and controlled vocabulary for describing things on the Internet. How is this useful?
Consider the CodeMeta project. It creates metadata files for software projects using Schema.org building blocks. There's even a handy CodeMeta Generator! If you maintain a phase-field software framework, you can (and should!) use it to document your code in a standards-compliant, machine-readable format. This improves interoperability and reusability!
{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"@type": "SoftwareSourceCode",
"license": "https://spdx.org/licenses/CC-PDDC",
"codeRepository": "git+https://github.com/usnistgov/hiperc",
"dateCreated": "2017-08-07",
"dateModified": "2019-03-04",
"downloadUrl": "https://github.com/usnistgov/hiperc/releases/tag/v1.0",
"issueTracker": "https://github.com/usnistgov/hiperc/issues",
"name": "HiPerC",
"version": "1.0.0",
"description": "High-Performance Computing in C and CUDA",
"applicationCategory": "phase-field",
"developmentStatus": "inactive",
"programmingLanguage": ["C", "CUDA", "OpenCL", "OpenMP", "TBB"],
"author": [
{
"@type": "Person",
"@id": "https://orcid.org/my-orcid?orcid=0000-0002-2920-8302",
"givenName": "Trevor",
"familyName": "Keller",
"email": "[email protected]",
"affiliation": {
"@type": "Organization",
"name": "NIST"
}
}
]
}
That's nice! But what about our datasets? Shouldn't the PFHub metadata "describing" a dataset live alongside that data?
We are working to build a phase-field schema (or schemas) using Schema.org and the schemaorg Python library. The work-alike port of meta.yaml
looks like the following.
N.B.: We're going to deploy a generator similar to CodeMeta's so you won't have to write this!
{
"@context": "https://www.schema.org",
"@type": "DataCatalog",
"author": [
{
"@type": "Person",
"affiliation": {
"@type": "GovernmentOrganization",
"name": "Materials Science and Engineering Division",
"parentOrganization": {
"@type": "GovernmentOrganization",
"name": "Material Measurement Laboratory",
"parentOrganization": {
"@type": "GovernmentOrganization",
"address": {
"@type": "PostalAddress",
"addressCountry": "US",
"addressLocality": "Gaithersburg",
"addressRegion": "Maryland",
"postalCode": "20899",
"streetAddress": "100 Bureau Drive"
},
"identifier": "NIST",
"name": "National Institute of Standards and Technology",
"parentOrganization": "U.S. Department of Commerce",
"url": "https://www.nist.gov"
}
}
},
"email": "[email protected]",
"familyName": "Keller",
"givenName": "Trevor",
"identifier": "tkphd",
"sameAs": "https://orcid.org/0000-0002-2920-8302"
}, {
"@type": "Person",
"affiliation": {
"@type": "GovernmentOrganization",
"name": "Materials Science and Engineering Division",
"parentOrganization": {
"@type": "GovernmentOrganization",
"name": "Material Measurement Laboratory",
"parentOrganization": {
"@type": "GovernmentOrganization",
"address": {
"@type": "PostalAddress",
"addressCountry": "US",
"addressLocality": "Gaithersburg",
"addressRegion": "Maryland",
"postalCode": "20899",
"streetAddress": "100 Bureau Drive"
},
"identifier": "NIST",
"name": "National Institute of Standards and Technology",
"parentOrganization": "U.S. Department of Commerce",
"url": "https://www.nist.gov"
}
}
},
"email": "[email protected]",
"familyName": "Wheeler",
"givenName": "Daniel",
"identifier": "wd15",
"sameAs": "https://orcid.org/0000-0002-2653-7418"
}
],
"dataset": [
{
"@type": "Dataset",
"distribution": [
{
"@type": "PropertyValue",
"name": "parallel_nodes",
"value": 1
}, {
"@type": "PropertyValue",
"name": "cpu_architecture",
"value": "amd64"
}, {
"@type": "PropertyValue",
"name": "parallel_cores",
"value": 12
}, {
"@type": "PropertyValue",
"name": "parallel_gpus",
"value": 1
}, {
"@type": "PropertyValue",
"name": "gpu_architecture",
"value": "nvidia"
}, {
"@type": "PropertyValue",
"name": "gpu_cores",
"value": 6144
}, {
"@type": "PropertyValue",
"name": "wall_time",
"unitCode": "SEC",
"unitText": "s",
"value": 384
}, {
"@type": "PropertyValue",
"name": "memory_usage",
"unitCode": "E63",
"unitText": "mebibyte",
"value": 1835
}
],
"name": "irl"
}, {
"@type": "Dataset",
"distribution": [
{
"@type": "DataDownload",
"contentUrl": "8a/free_energy_1.csv",
"name": "free energy"
}, {
"@type": "DataDownload",
"contentUrl": "8a/solid_fraction_1.csv",
"name": "solid fraction"
}, {
"@type": "DataDownload",
"contentUrl": "8a/free_energy_2.csv",
"name": "free energy"
}, {
"@type": "DataDownload",
"contentUrl": "8a/solid_fraction_2.csv",
"name": "solid fraction"
}, {
"@type": "DataDownload",
"contentUrl": "8a/free_energy_3.csv",
"name": "free energy"
}, {
"@type": "DataDownload",
"contentUrl": "8a/solid_fraction_3.csv",
"name": "solid fraction"
}
],
"name": "output"
}
],
"dateCreated": "2022-10-25T19:25:02+00:00",
"description": "A fake dataset for Benchmark 8a unprepared using FiPy by @tkphd & @wd15",
"isBasedOn": {
"@type": "SoftwareSourceCode",
"codeRepository": "https://github.com/tkphd/fake-pfhub-bm8a",
"description": "Fake benchmark 8a upload with FiPy",
"runtimePlatform": "fipy",
"targetProduct": "amd64",
"version": "9df6603e"
},
"isPartOf": {
"@type": "Series",
"identifier": "8a",
"name": "Homogeneous Nucleation",
"url": "https://pages.nist.gov/pfhub/benchmarks/benchmark8.ipynb"
},
"keywords": [
"phase-field",
"benchmarks",
"pfhub",
"fipy",
"homogeneous-nucleation"
],
"license": "https://www.nist.gov/open/license#software"
}
- Home
- Benchmark Presentations
- Peter Voorhees Phase Field Lectures
- Phase-Field Method Recommended Practices
- Workshop Presentations
- Miscellaneous