Skip to content

Latest commit

 

History

History
230 lines (173 loc) · 15.9 KB

README.md

File metadata and controls

230 lines (173 loc) · 15.9 KB

ML Model Extension Specification

This document explains the ML Model Extension to the SpatioTemporal Asset Catalog (STAC) specification.

Scope & Vision

The goal of the STAC ML Model Extension is to provide a way of cataloging machine learning (ML) models that operate on Earth observation (EO) data described as a STAC catalog. The metadata related to machine learning models and their related artifacts (e.g. training data, performance metrics, etc.) can be extremely broad. This extension limits its scope to ML model metadata that aids in the discoverability and usability/reusability of these models for the following types of use-cases:

  • Adoption of Models in Analytic Pipelines

    Individuals and organizations hoping to use ML model predictions into their own analytic pipelines need a way of discovering models that will work for a given geographic area, application domain, and type of input data and that produce a specific kind of output (object detection v. classification). Consider the example of a global non-profit organization that wants to use ML to track deforestation. A data engineer from this organization might be interested in discovering segmentation models that accurately produce land cover classes over parts of South America using Sentinel 2 imagery. The STAC ML Model Extension aims to support this use-case by describing metadata related to the recommended area over which the model may be used, a description of the model architecture and type of input data it requires, and links to containerized model images or model files that can be used to run the model to generate inferences.

  • Re-training of Existing Models in New Contexts

    The process of training ML models on Earth observation data can be extremely time-consuming and costly due to the volume of data required. Providing tools that ease the discovery of existing models and training data will make ML models more accessible by reducing this training burden. Suppose the non-profit from the previous example found a model that generated the kind of predictions they were interested in, but was not applicable to their region of interest. Rather than creating a new model from scratch, the organization might be interested in using transfer learning to re-train the existing model on data from their area of interest. In this case, they would need enough information about the training environment and model architecture to reproduce the model weights and continue training the model using new data. The STAC ML Model Extensions aims to support this use-case by providing links to serialized versions of the model (e.g. a PyTorch checkpoint file) as well as enough detail about the training environment that a data scientist could reasonably implement transfer learning using new data.

  • Reproducibility of ML Experiments

    The ability to reproduce published ML experiments is crucial for verifying and building upon previous ML research. Increasingly, individuals and institutions are making an effort to publish code and examples along with academic publications to enable this kind of reproducibility. However, the quality and usability of this code and related documentation can vary widely and there are currently no standards that ensure that a new researcher could reproduce a given set of published results from the documentation. The STAC ML Model Extension aims to address this issue by providing a detailed description of the training data and environment used in a ML model experiment.

Item Properties

Field Name Type Description
ml-model:type string REQUIRED. This MUST always be the constant "ml-model". This purpose of this field is to provide a convenient way to filter ML Model Items in a STAC API.
ml-model:learning_approach string REQUIRED. The learning approach used to train the model. It is STRONGLY RECOMMENDED that you use one of the values described below, but other values are allowed.
ml-model:prediction_type string REQUIRED. The type of prediction that the model makes. It is STRONGLY RECOMMENDED that you use one of the values described below, but other values are allowed.
ml-model:architecture string REQUIRED. Identifies the architecture employed by the model (e.g. RCNN, U-Net, etc.). This may be any string identifier, but publishers are encouraged to use well-known identifiers whenever possible.
ml-model:training-processor-type string The type of processor used during training. Must be one of "cpu" or "gpu".
ml-model:training-os string Identifies the operating system on which the model was trained. See the description below for recommended values.

Additional Field Information

ml-model:learning_approach

Describes the learning approach used to train the model. It is STRONGLY RECOMMENDED that you use one of the following values, but other values are allowed.

  • "supervised"
  • "unsupervised"
  • "semi-supervised"
  • "reinforcement-learning"

ml-model:prediction_type

Describes the type of predictions made by the model. It is STRONGLY RECOMMENDED that you use one of the following values, but other values are allowed. Note that not all Prediction Type values are valid for a given Learning Approach.

  • "object-detection"
  • "classification"
  • "segmentation"
  • "regression"

ml-model:training-os

It is STRONGLY RECOMMENDED that one of the following operating system identifiers (taken from the Python sys.platform values be used whenever possible:

  • aix
  • linux
  • win32
  • cygwin
  • darwin

Asset Objects

Roles

Role Name Description
ml-model:inference-runtime Represents a file containing instructions for running a containerized version of the model to generate inferences. See the Inference/Training Runtimes section below for details on related fields.
ml-model:training-runtime Represents a file containing instructions for running a container to train the model. See the Inference/Training Runtimes section below for details on related fields.
ml-model:checkpoint Represents a PyTorch checkpoint file that can used to load the model (see official PyTorch documentation for details)

Inference/Training Runtimes

Assets with the ml-model:inference-runtime or ml-model:training-runtime role represents files containing instructions for running a containerized version of the model to either generate inferences or train the model, respectively. Currently, only Compose files are supported, but support is planned for other formats, including Common Workflow Language (CWL) and Workflow Description Language (WDL).

The "type" field should be used to indicate the format of this asset. Assets in the Compose format should have a "type" value of "text/x-yaml; application=compose".

While the Compose file defines nearly all of the parameters required to run the containerized model image, we still need a way to define which host directory containing input data should be mounted to the container and to which host directory the output predictions should be written. The Compose file MUST define volume mounts for input and output data using the Compose Interpolation syntax. The input data volume MUST be defined by an INPUT_DATA variable and the output data volume MUST be defined by an OUTPUT_DATA variable.

For example, the following Compose file snippet would mount the host input directory to /var/data/input in the container and would mount the host output data directory to /var/data/output in the host container. In this contrived example, the script to run the model takes 2 arguments: the location to the input data directory and the location to the output data directory.

services:
  ...
  model_runtime:
    ...
    volumes:
      - "${INPUT_DATA}:/var/data/input"
      - "${OUTPUT_DATA}:/var/data/output"
    command: "run-model.sh /var/data/input /var/data/output"

A user would then set the INPUT_DATA and OUTPUT_DATA environment variables when running the model. An example using docker-compose might look like the following:

$ INPUT_DATA=/local/path/to/model/inputs; \
  OUTPUT_DATA=/local/path/to/model/outputs; \
  docker-compose up -f path/to/inference-runtime.yml

It is RECOMMENDED that model publishers use the Asset description field to describe any other requirements or constraints for running the model container.

Relation types

The following types should be used as applicable rel types in the Link Object.

Type Description
ml-model:inferencing-image Links with this relation type refer to Docker images that may be used to generate inferences using the model. The href value for links of this type should contain a fully-qualified URI for the image as would be required for a command like docker pull. These URIs should be of the form <registry_domain>/<user_or_organization_name>/<image_name>:<tag>. Links with this relation type should have a "type" value of "docker-image" to indicate a Docker image.
ml-model:training-image Links with this relation type refer to Docker images that may be used to train the model. The href value for links of this type should contain a fully-qualified URI for the image as would be required for a command like docker pull. These URIs should be of the form <registry_domain>/<user_or_organization_name>/<image_name>:<tag>. Links with this relation type should have a "type" value of "docker-image" to indicate a Docker image.
ml-model:train-data Links with this relation type refer to datasets used to train the model. It is STRONGLY RECOMMENDED that these links refer to a STAC Collection implementing the Label Extension
ml-model:test-data Links with this relation type refer to datasets used to test the model during training. It is STRONGLY RECOMMENDED that these links refer to a STAC Collection implementing the Label Extension.

Interpretation of STAC Fields

The semantics of ML model metadata can sometimes differ significantly from the use-cases for which STAC was originally intended (Earth observation data). We feel that the benefits of structuring this metadata as a STAC Extensions outweigh the possible downsides, but it does require us to be specific about how certain STAC fields should be interpreted. The following definitions clarify the meaning of core fields from the STAC spec; for any fields not specifically defined here, please refer to the core STAC spec.

Spatiotemporal Fields

Field Name Type Description
geometry GeoJSON Geometry Object The geographic area over which the model may be used. Note that this may be the same as the area over which the model was trained, but could also represent additional areas where model performance has been tested or where the model publisher believes it will perform well based on similarities to the training environment.
start_datetime string The first or start date and time for images that should be used to generate inferences using the model.
end_datetime string The last or end date and time for images that should be used to generate inferences using the model. To represent an open interval (e.g. imagery from 2021-01-01T00:00:00Z or later), use the maximum value ("9999-12-31T23:59:59Z") for end_datetime.
datetime string This should always be null, since a date range (using start_datetime and end_datetime) will almost always be more appropriate.

All other fields defined in the STAC Common Metadata documentation should be interpreted as referring to imagery that may be used for running the model to generate inferences.

Usage with Other STAC Extensions

It is RECOMMENDED that following STAC Extensions be used in conjunction with the ML Model STAC Extension to fully describe geospatial ML models:

  • Scientific Citation Extension: This extension should be used to describe how the model should cited in publications, as well as to reference any existing publications associated with the model.

Contributing

All contributions are subject to the STAC Specification Code of Conduct. For contributions, please follow the STAC specification contributing guide Instructions for running tests are copied here for convenience.

Contributing Examples & Tutorials

All community members are encouraged to contributes their own examples of cataloged ML models. If you have a model that you have cataloged using this extension, please open a PR to include it in the examples directory. Here are some guidelines for contributing example catalogs:

  • New examples should go in their own sub-directory under the examples directory (e.g. examples/my-new-model)
  • All links and assets referenced in the catalog must be publicly available
  • Include any supplementary files (model checkpoint files, etc.) that are not served publicly in the same directory as the model catalog and reference these using relative links.
  • If possible, please include a Collection (even if it only contains a single Item) with any relevant summaries.

Running tests

The same checks that run as checks on PR's are part of the repository and can be run locally to verify that changes are valid. To run tests locally, you'll need npm, which is a standard part of any node.js installation.

First you'll need to install everything with npm once. Just navigate to the root of this repository and on your command line run:

npm install

Then to check markdown formatting and test the examples against the JSON schema, you can run:

npm test

This will spit out the same texts that you see online, and you can then go and fix your markdown or examples.

If the tests reveal formatting problems with the examples, you can fix them with:

npm run format-examples