Skip to content

Data Generation and Curation

Daniel Wheeler edited this page Mar 21, 2024 · 14 revisions

Trevor Keller (@tkphd) and Daniel Wheeler (@wd15)

  • Save the data from your published work as much as possible, with meta data
  • Save the inputs used to produce the results from all your published work

FAIR Data

We discussed the FAIR Principles at CHiMaD Phase-Field XIII:

Findable

  • (Meta)data are assigned a globally unique and persistent identifier
  • Data are described with rich metadata (defined by R1 below)
  • Metadata clearly and explicitly include the identifier of the data they describe
  • (Meta)data are registered or indexed in a searchable resource

Accessible

  • (Meta)data are retrievable by their identifier using a standardized communications protocol
    • The protocol is open, free, and universally implementable
    • The protocol allows for an authentication and authorisation procedure, where necessary
  • Metadata are accessible, even when the data are no longer available

Interoperable

  • (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
  • (Meta)data use vocabularies that follow FAIR principles
  • (Meta)data include qualified references to other (meta)data

Reusable

  • (Meta)data are richly described with a plurality of accurate and relevant attributes
    • (Meta)data are released with a clear and accessible data usage license
    • (Meta)data are associated with detailed provenance
    • (Meta)data meet domain-relevant community standards

Zenodo

Historically, PFHub has accepted datasets linked from any host on the Web. At this time, we recommend using Zenodo to host your benchmark data. Why? It's not "just" a shared folder.

  • Guided prompts to describe what you're uploading
  • DOI is automatically assigned to your dataset
  • Basic metadata exported in multiple formats
  • Browser-based viewers for CSV, Markdown, PDF, images, videos

Metadata Examples

Zenodo gives you the option to import a repository directly from GitHub. The original FAIR Phase-field talk was "uploaded" this way, producing the following record. While basic authorship information was captured, this tells an interested person or machine nothing meaningful about the dataset.

{
  "@context": "https://schema.org/",
  "@id": "https://doi.org/10.5281/zenodo.6540105", 
  "@type": "SoftwareSourceCode", 
  "name": "tkphd/fair-phase-field-data: CHiMaD Phase-field XIII",
  "description": "FAIR Principles for Phase-Field Practitioners", 
  "version": "v0.1.0",
  "license": "", 
  "identifier": "https://doi.org/10.5281/zenodo.6540105", 
  "url": "https://zenodo.org/record/6540105", 
  "datePublished": "2022-05-11", 
  "creator": [{
      "@type": "Person", 
      "givenName": "Trevor",
      "familyName":  "Keller",
      "affiliation": "NIST"}], 
  "codeRepository": "https://github.com/tkphd/fair-phase-field-data/tree/v0.1.0"
}

The strongly preferred method is to upload files directly. The following record represents an upload for Benchmark 1b using HiPerC. I would consider this metadata rich!

{
  "@context": "https://schema.org/", 
  "@id": "https://doi.org/10.5281/zenodo.1124941", 
  "@type": "Dataset", 
  "name": "hiperc-gpu-cuda-spinodal"
  "description": "Solution to the CHiMaD Phase Field benchmark problem on spinodal decomposition using CUDA, with a 9-point discrete Laplacian stencil", 
  "identifier": "https://doi.org/10.5281/zenodo.1124941", 
  "license": "https://creativecommons.org/licenses/by/4.0/legalcode", 
  "url": "https://zenodo.org/record/1124941", 
  "datePublished": "2017-12-21", 
  "creator": [{
      "@type": "Person", 
      "@id": "https://orcid.org/0000-0002-2920-8302", 
      "givenName": "Trevor",
      "familyName":  "Keller",
      "affiliation": "NIST"}], 
  "keywords": ["phase-field", "pfhub", "chimad"], 
  "sameAs": ["https://doi.org/10.6084/m9.figshare.5715103.v2"], 
  "distribution": [
    {
      "@type": "DataDownload",
      "contentUrl": "https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/free-energy-9pt.csv", 
      "encodingFormat": "csv"
    }, {
      "@type": "DataDownload",
      "contentUrl": "https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal.0000000.png", 
      "encodingFormat": "png"
    }, {
      "@type": "DataDownload",
      "contentUrl": "https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal.0100000.png", 
      "encodingFormat": "png" 
    }, {
      "@type": "DataDownload",
      "contentUrl": "https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal.0200000.png", 
      "encodingFormat": "png" 
    }]
}

Metadata Files

After uploading the HiPerC simulation data, I also registered it with PFHub using meta.yaml. This file tells the website-generating machinery what to do with the dataset, and provides additional information about the resources required to perform the simulation.

---
benchmark:
  id: 1b
  version: '1'
data:
- name: run_time
  values:
  - sim_time: '200000'
    wall_time: '7464'
- name: memory_usage
  values:
  - unit: KB
    value: '308224'
- description: free energy data
  url: https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/free-energy-9pt.csv
- description: microstructure at t=0
  type: image
  url: https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal-000000.png
- description: microstructure at t=100,000
  type: image
  url: https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal-100000.png
- description: microstructure at t=200,000
  type: image
  url: https://zenodo.org/api/files/ce1ca4a3-b6bc-4e2c-9b70-8fe45fc243fd/spinodal-200000.png
metadata:
  author:
    email: [email protected]
    first: Trevor
    github_id: tkphd
    last: Keller
  hardware:
    acc_architecture: gpu
    clock_rate: '1.48'
    cores: '1792'
    cpu_architecture: x86_64
    nodes: 1
    parallel_model: threaded
  implementation:
    repo:
      url: https://github.com/usnistgov/hiperc
      version: b25b14acda7c5aef565cdbcfc88f2df3412dcc46
  simulation_name: hiperc_cuda
  summary: HiPerC spinodal decomposition result using CUDA on a Tesla P100
  timestamp: 18 December, 2017

This file is not part of my dataset: it resides in the PFHub repository on GitHub. Furthermore, since the structure of this file specifically suits PFHub, it is of no use at all to other software, websites, or researchers.

Structured Data Schemas

In the Zenodo metadata above, note the @context fields: Schema.org is a structured data schema and controlled vocabulary for describing things on the Internet. How is this useful?

Consider the CodeMeta project. It creates metadata files for software projects using Schema.org building blocks. There's even a handy CodeMeta Generator! If you maintain a phase-field software framework, you can (and should!) use it to document your code in a standards-compliant, machine-readable format. This improves interoperability and reusability!

{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "@type": "SoftwareSourceCode",
    "license": "https://spdx.org/licenses/CC-PDDC",
    "codeRepository": "git+https://github.com/usnistgov/hiperc",
    "dateCreated": "2017-08-07",
    "dateModified": "2019-03-04",
    "downloadUrl": "https://github.com/usnistgov/hiperc/releases/tag/v1.0",
    "issueTracker": "https://github.com/usnistgov/hiperc/issues",
    "name": "HiPerC",
    "version": "1.0.0",
    "description": "High-Performance Computing in C and CUDA",
    "applicationCategory": "phase-field",
    "developmentStatus": "inactive",
    "programmingLanguage": ["C", "CUDA", "OpenCL", "OpenMP", "TBB"],
    "author": [
        {
            "@type": "Person",
            "@id": "https://orcid.org/my-orcid?orcid=0000-0002-2920-8302",
            "givenName": "Trevor",
            "familyName": "Keller",
            "email": "[email protected]",
            "affiliation": {
                "@type": "Organization",
                "name": "NIST"
            }
        }
    ]
}

That's nice! But what about our datasets? Shouldn't the PFHub metadata "describing" a dataset live alongside that data?

Towards a Phase-Field Schema

We are working to build a phase-field schema (or schemas) using Schema.org and the schemaorg Python library. The work-alike port of meta.yaml looks like the following.

N.B.: We're going to deploy a generator similar to CodeMeta's so you won't have to write this!

{
    "@context": "https://www.schema.org",
    "@type": "DataCatalog",
    "author": [
        {
            "@type": "Person",
            "affiliation": {
                "@type": "GovernmentOrganization",
                "name": "Materials Science and Engineering Division",
                "parentOrganization": {
                    "@type": "GovernmentOrganization",
                    "name": "Material Measurement Laboratory",
                    "parentOrganization": {
                        "@type": "GovernmentOrganization",
                        "address": {
                            "@type": "PostalAddress",
                            "addressCountry": "US",
                            "addressLocality": "Gaithersburg",
                            "addressRegion": "Maryland",
                            "postalCode": "20899",
                            "streetAddress": "100 Bureau Drive"
                        },
                        "identifier": "NIST",
                        "name": "National Institute of Standards and Technology",
                        "parentOrganization": "U.S. Department of Commerce",
                        "url": "https://www.nist.gov"
                    }
                }
            },
            "email": "[email protected]",
            "familyName": "Keller",
            "givenName": "Trevor",
            "identifier": "tkphd",
            "sameAs": "https://orcid.org/0000-0002-2920-8302"
        }, {
            "@type": "Person",
            "affiliation": {
                "@type": "GovernmentOrganization",
                "name": "Materials Science and Engineering Division",
                "parentOrganization": {
                    "@type": "GovernmentOrganization",
                    "name": "Material Measurement Laboratory",
                    "parentOrganization": {
                        "@type": "GovernmentOrganization",
                        "address": {
                            "@type": "PostalAddress",
                            "addressCountry": "US",
                            "addressLocality": "Gaithersburg",
                            "addressRegion": "Maryland",
                            "postalCode": "20899",
                            "streetAddress": "100 Bureau Drive"
                        },
                        "identifier": "NIST",
                        "name": "National Institute of Standards and Technology",
                        "parentOrganization": "U.S. Department of Commerce",
                        "url": "https://www.nist.gov"
                    }
                }
            },
            "email": "[email protected]",
            "familyName": "Wheeler",
            "givenName": "Daniel",
            "identifier": "wd15",
            "sameAs": "https://orcid.org/0000-0002-2653-7418"
        }
    ],
    "dataset": [
        {
            "@type": "Dataset",
            "distribution": [
                {
                    "@type": "PropertyValue",
                    "name": "parallel_nodes",
                    "value": 1
                }, {
                    "@type": "PropertyValue",
                    "name": "cpu_architecture",
                    "value": "amd64"
                }, {
                    "@type": "PropertyValue",
                    "name": "parallel_cores",
                    "value": 12
                }, {
                    "@type": "PropertyValue",
                    "name": "parallel_gpus",
                    "value": 1
                }, {
                    "@type": "PropertyValue",
                    "name": "gpu_architecture",
                    "value": "nvidia"
                }, {
                    "@type": "PropertyValue",
                    "name": "gpu_cores",
                    "value": 6144
                }, {
                    "@type": "PropertyValue",
                    "name": "wall_time",
                    "unitCode": "SEC",
                    "unitText": "s",
                    "value": 384
                }, {
                    "@type": "PropertyValue",
                    "name": "memory_usage",
                    "unitCode": "E63",
                    "unitText": "mebibyte",
                    "value": 1835
                }
            ],
            "name": "irl"
        }, {
            "@type": "Dataset",
            "distribution": [
                {
                    "@type": "DataDownload",
                    "contentUrl": "8a/free_energy_1.csv",
                    "name": "free energy"
                }, {
                    "@type": "DataDownload",
                    "contentUrl": "8a/solid_fraction_1.csv",
                    "name": "solid fraction"
                }, {
                    "@type": "DataDownload",
                    "contentUrl": "8a/free_energy_2.csv",
                    "name": "free energy"
                }, {
                    "@type": "DataDownload",
                    "contentUrl": "8a/solid_fraction_2.csv",
                    "name": "solid fraction"
                }, {
                    "@type": "DataDownload",
                    "contentUrl": "8a/free_energy_3.csv",
                    "name": "free energy"
                }, {
                    "@type": "DataDownload",
                    "contentUrl": "8a/solid_fraction_3.csv",
                    "name": "solid fraction"
                }
            ],
            "name": "output"
        }
    ],
    "dateCreated": "2022-10-25T19:25:02+00:00",
    "description": "A fake dataset for Benchmark 8a unprepared using FiPy by @tkphd & @wd15",
    "isBasedOn": {
        "@type": "SoftwareSourceCode",
        "codeRepository": "https://github.com/tkphd/fake-pfhub-bm8a",
        "description": "Fake benchmark 8a upload with FiPy",
        "runtimePlatform": "fipy",
        "targetProduct": "amd64",
        "version": "9df6603e"
    },
    "isPartOf": {
        "@type": "Series",
        "identifier": "8a",
        "name": "Homogeneous Nucleation",
        "url": "https://pages.nist.gov/pfhub/benchmarks/benchmark8.ipynb"
    },
    "keywords": [
        "phase-field",
        "benchmarks",
        "pfhub",
        "fipy",
        "homogeneous-nucleation"
    ],
    "license": "https://www.nist.gov/open/license#software"
}
Clone this wiki locally