This project shows how we can generate SLSA provenance for ML models on GitHub Actions and Google Cloud Platform.
SLSA was originally developed for traditional software to protect against tampering with builds, such as in the Solarwinds attack, and this project is a proof of concept that the same supply chain protections can be applied to ML.
When users download a given version of a model they can also check its provenance. This can be integrated in the model hub and/or model serving platforms: for example the model serving pipeline could validate provenance for all new models before serving them. However, the verification can also be done manually, on demand.
As an additional benefit, having provenance for a model allows users to react to vulnerabilities in a training framework: they can quickly determine if a model needs to be retrained because it was created using a vulnerable version.
See the guides for GitHub Actions and Google Cloud Platform for details.
We support both TensorFlow and PyTorch models. The example repo trains a model on CIFAR10 dataset, saves it in one of the supported formats, and generates provenance for the output. The supported formats are:
Workflow Argument | Training Framework | Model format |
---|---|---|
tensorflow_model.keras |
TensorFlow | Keras format (default) |
tensorflow_hdf5_model.h5 |
TensorFlow | Legacy HDF5 format |
tensorflow_hdf5.weights.h5 |
TensorFlow | Legacy HDF5 weights only format |
pytorch_model.pth |
PyTorch | PyTorch default format |
pytorch_full_model.pth |
PyTorch | PyTorch complete model format |
pytorch_jitted_model.pt |
PyTorch | PyTorch TorchScript format |
While most of the ML models are currently too expensive to train, future work will cover the training of ML models that require access to accelerators (i.e., GPUs, TPUs) or that require multiple hours for training.
Future work will involve covering training ML models that require access to accelerators (i.e., GPUs, TPUs).
While our examples have targeted GitHub Actions and Tekton in GCP, we aim to bring support for other platforms (e.g., GCB and GitLab) and model training environments.
TensorFlow also supports saving models in SavedModel
format. This is
a directory-based serialization format and currently we don't fully support
this. We can generate SLSA provenance for all the files in the directory but
there are caveats regarding verification. Furthermore, because there is a
difference between the hashes generated by provenance and the hash generated
during model signing, we have decided to add support for these model formats at
a future time, after standardizing a way to generate and verify provenance in
SLSA (in general, not just for ML).