Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CEP for the OCI storage of conda packages & repodata #70

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
91 changes: 73 additions & 18 deletions cep-oci.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,30 +14,33 @@ An OCI manifest is referenced by a name and a tag.

For further details, please refer to the official [OCI Distribution spec](https://github.com/opencontainers/distribution-spec/blob/v1.0/spec.md#definitions).

### Conda package artifacts on an OCI registry

The manifest for a conda package on an OCI registry should look like follows.
### Layers

It should have a name and a tag. The name is `<channel>/<subdir>/<package-name>`.
The tag is the version and build string of the packages, using a `-` as a separator.
Each layer must be a [descriptor](https://github.com/opencontainers/image-spec/blob/main/descriptor.md#properties) containing at least the 3 required fields:

For example, a package like `xtensor-0.10.4-h431234.conda` would map to a OCI registry `conda-forge/linux-64/xtensor:0.10.4-h431234`.
- The `mediaType` of the referenced content.
- The `digest` of the targeted content.
- The `size` of the raw content (in bytes).

### Layers
### MediaTypes

A conda package, in an OCI registry, should ship up to 3 layers:
Global and already defined mediaTypes are described [here](https://github.com/opencontainers/image-spec/blob/main/media-types.md#oci-image-media-types).

- The package data itself, as a tarball. (mandatory)
- This can be either a `.tar.bz2` (v1) or a `.conda` (v2) file, or both as separate layers.
- The package `info` folder as a gzipped "tar.gz" file.
- The package `info/index.json` file as a plain JSON file.
Custom mediaTypes defined for the conda channels use case are as follows:

The mediaType for the different layers is as follows:
| Blob type | Content type | mediaType |
|------------------|---------------------------|---------------------------------------------|
| conda package | .tar.bz2 package | application/vnd.conda.package.v1 |
| conda package | .conda package | application/vnd.conda.package.v2 |
| package info | `info` folder as gzip | application/vnd.conda.info.v1.tar+gzip |
| package info | `index.json` file | application/vnd.conda.info.index.v1+json |
| repodata | `repodata.json` file | application/vnd.conda.repodata.v1+json |
| repodata | `repodata.json.zst` file | application/vnd.conda.repodata.v1+json+zst |
| repodata | `repodata.json.gz` file | application/vnd.conda.repodata.v1+json+gzip |
| repodata | `repodata.json.bz2` file | application/vnd.conda.repodata.v1+json+bz2 |
| repodata | `repodata.json.jlap` file | application/vnd.conda.jlap.v1 |

- for a .tar.bz2 package, the mediaType is `application/vnd.conda.package.v1`
- for a .conda package, the mediaType is `application/vnd.conda.package.v2`
- for the `info` folder as gzip the mediaType is `application/vnd.conda.info.v1.tar+gzip`
- for the `index.json` file the mediaType is `application/vnd.conda.info.index.v1+json`
If needed, more mediaTypes could be specified (i.e `application/vnd.conda.info.v1.tar+zst`).

Using the `mediaType` field in the manifest, we can find the layer + SHA256 hash to pull the corresponding blob.
Each `mediaType` should only be present in one layer.
Expand All @@ -51,7 +54,7 @@ On an OCI registry it should be stored under `<channel>/<subdir>/repodata.json`.
The repodata file should have one entry that has the `latest` tag. This entry should point to the latest version of the repodata.
All versions of the repodata should also be tagged with a UTC timestamp of the following format: `YYYY.MM.DD.HH.MM.SS`, e.g. `2024.04.12.07.06.32`.

The mediaType for the raw `repodata.json` file is `application/vnd.conda.repodata.v1+json`. However, for large repositories it's advised to store the `zstd` encoded repodata file with the mediaType `application/vnd.conda.repodata.v1+json+zstd` as an additional layer in `<channel>/<subdir>/repodata.json`. ([ref](https://github.com/opencontainers/image-spec/blob/main/layer.md#gzip-media-types))
The mediaType for the raw `repodata.json` file is `application/vnd.conda.repodata.v1+json`. However, for large repositories it's advised to store the `zstd` encoded repodata file with the mediaType `application/vnd.conda.repodata.v1+json+zst` as an additional layer in `<channel>/<subdir>/repodata.json`. ([ref](https://github.com/opencontainers/image-spec/blob/main/layer.md#gzip-media-types))

Other encodings are also accepted:

Expand All @@ -64,6 +67,22 @@ For `jlap`, the following mediaType is used:

The `jlap` file should also be stored under the `<channel>/<subdir>/repodata.json` path as an additional layer.

### Conda package artifacts on an OCI registry

The manifest for a conda package on an OCI registry should look like follows.

It should have a name and a tag. The name is `<channel>/<subdir>/<package-name>`.
The tag is the version and build string of the packages, using a `-` as a separator.

For example, a package like `xtensor-0.10.4-h431234.conda` would map to a OCI registry `conda-forge/linux-64/xtensor:0.10.4-h431234`.

A conda package, in an OCI registry, should ship up to 3 layers:

- The package data itself, as a tarball. (mandatory)
- This can be either a `.tar.bz2` (v1) or a `.conda` (v2) file, or both as separate layers.
- The package `info` folder as a gzipped "tar.gz" file.
- The package `info/index.json` file as a plain JSON file.

### Mapping a conda-package to the OCI registry

A given conda-package is identified by a URL like `<subdir>/<package-name>-<version>-<build>.<ext>` where `<subdir>` is the platform and architecture, `<package-name>` is the name of the package, `<version>` is the version of the package, `<build>` is the build string of the package, and `<ext>` is the extension of the package file.
Expand Down Expand Up @@ -96,3 +115,39 @@ Some characters that are used in the conda-forge repository as part of the build
- `+` is replaced by `__p__`
- `!` is replaced by `__e__`
- `=` is replaced by `__eq__`

#### Authentication

Pulling a public image from a Container registry can be done anonymously ([ref](https://docs.github.com/en/packages/learn-github-packages/about-permissions-for-github-packages#visibility-and-access-permissions-for-packages)).

A token can be requested with `pull` scope, using the following URL:
`https://ghcr.io/token?scope=repository:<org>/<channel-name>/<subdir>/<package-name-or-repodata.json>:pull`

Note that in the case of pulling repodata, the name `repodata.json` is always used in the URL regardless of the encoding.

#### Implementation (conda / mamba / rattler)

##### mamba

In order to fetch packages from an OCI registry, we need to set a mirror (can be more than one) for the channel to be used (e.g `conda-forge`).
This can be done in the rc file as follows:

```
mirrored_channels:
conda-forge: ["oci://ghcr.io/channel-mirrors/conda-forge"]
```

When a user requests installing a package (with the configuration set above, and using `conda-forge` channel), a set of requests to fetch `repodata.json` are first performed as follows:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we marking these as mirrored instead making them first-class things via using the normal channels config setting with an oci:// prefix to mark OCI channels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is orthogonal - you should be able to use the oci:// URL as your only channel as well.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why do we need the mirrored_channels option at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not in this CEP? Specifying how we expect channel mirroring works could have merits. Specifically, we should define where the repodata.json file comes from when we have multiple mirrors.

Ideally we would also define how we combine download counts from multiple mirrors (but all that would definitely belong into a different CEP).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should handle mirroring in another CEP I would think.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's orthogonal. I changed it to not mention mirrors anymore.


- A token is requested to anonymously pull `repodata.json` using the following URL:\
`https://ghcr.io/token?scope=repository:channel-mirrors/conda-forge/<subdir>/repodata.json:pull`
- The manifest is then pulled using `https://ghcr.io/v2/channel-mirrors/conda-forge/<subdir>/repodata.json/manifests/<reference>`.\
`<reference>` is always set to `latest` in `mamba`.\
This is also where the repodata file encoding is handled (checking `mediaType` field in the layers).\
In `mamba`, `zstd` encoding has priority if present, otherwise, raw `repodata.json` is picked, and the corresponding SHA256 hash is set for the next step.
- Repodata blob is then downloaded using:\
`https://ghcr.io/v2/channel-mirrors/conda-forge/<subdir>/repodata.json/blobs/sha256:<HASH>`

Then, to fetch the package itself, and using the same token, the corresponding blob is downloaded using:
`https://ghcr.io/v2/channel-mirrors/conda-forge/<subdir>/<package-name>/blobs/sha256:<HASH>`
where <HASH> is the SHA256 hash of the requested package, retrieved from `repodata.json`.