Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CEP for the OCI storage of conda packages & repodata #70

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
62 changes: 62 additions & 0 deletions cep-oci.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# OCI registries as conda channels

We want to use OCI registries as a storage for conda packages. This CEP specifies how we lay out conda packages on an OCI registry.

## Specification
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distribution-spec contains useful definitions for these terms:

https://github.com/opencontainers/distribution-spec/blob/v1.0/spec.md#definitions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I linked it :)


An OCI artifact consists of a manifest and a set of blobs. The manifest is a JSON document that describes the contents of the artifact. The blobs are the actual data that the manifest refers to. The manifest is stored in the registry as a blob, and the blobs are stored in the registry as blobs.

The manifest consists of some metadata and a number of "layers". Each layer is a reference to a blob.

Layers can have arbitrary names and mediaTypes.

An OCI manifest is referenced by a name and a tag.

### Conda package artifacts on an OCI registry

The manifest for a conda package on an OCI registry should look like follows.

It should have a name and a tag. The name is `<channel>/<subdir>/<package-name>`.
The tag is the version and build string of the packages, using a `-` as a separator.

For example, a package like `xtensor-0.10.4-h431234.conda` would map to a OCI registry `conda-forge/linux-64/xtensor:0.10.4-h431234`.

### Layers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A table with the different layer mediaTypes and the expected contents would make this section super easy to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to format and push to this PR! :) This is a collaborative document


A conda package, in an OCI registry, should ship up to 3 layers:

- The package itself, as a tarball. (mandatory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the contents of the package, or the compressed artifact (tarball or not)? In .conda the outer layer is actually a ZIP file, and the inner ones are zstd tarballs. Would be helpful to clarify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, should leave the tarball out of the sentence. It's the package data itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, some artifacts in defaults are available both as .tar.bz2 and .conda. Should we restrict one package layer for each label? I don't think we need to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything would work fine if there are two layers, they should just not have the same mediaType.

- The package `info` folder as a gzipped "tar.gz" file.
- The package `info/index.json` file as a plain JSON file.

The mediaType for the different layers is as follows:

- for a .tar.bz2 package, the mediaType is `application/vnd.conda.package.v1`
- for a .conda package, the mediaType is `application/vnd.conda.package.v2`
- for the `info` folder as gzip the mediaType is `application/vnd.conda.info.v1.tar+gzip`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The info folder can get heavy with licenses, test files and what not. Will we allow .tar+zstd, for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we currently have uploaded the whole conda-forge with the gzip one, I would suggest to keep it like that for now. We could move to a zstd encoded one in the future.

Or we could just allow for any encoding (+gzip or +zstd)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be able to use the unmodified info-pip-23.3.1-py312haa95532_0.tar.zst out of the .conda

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I used the tar.gz approach because of how easy it is to open and explore it with pure Python. But the unmodified one makes sense as well. WE could specify that application/vnd.conda.info.v1.tar+zst will also be accepted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.
.tar.gz just seemed odd although gzip is very ordinary it hasn't been used in conda packaging.

wolfv marked this conversation as resolved.
Show resolved Hide resolved
- for the `index.json` file the mediaType is `application/vnd.conda.info.index.v1+json`

Using the `mediaType` field in the manifest, we can find the layer + SHA256 hash to pull the corresponding blob.
Each `mediaType` should only be present in one layer.

## Repodata on OCI registries

The `repodata.json` file is a JSON file that contains metadata about the packages in a channel.
It is used by conda to find packages in a channel.

On an OCI registry it should be stored under `<channel>/<subdir>/repodata.json`.
The repodata file should have one entry that has the `latest` tag. This entry should point to the latest version of the repodata.
All versions of the repodata should also be tagged with a timestamp of the following format: `YYYY.MM.DD.HH.MM`, e.g. `2024.04.12.07.06`.
wolfv marked this conversation as resolved.
Show resolved Hide resolved

The mediaType for the raw `repodata.json` file is `application/vnd.conda.repodata.v1+json`. However, for large repositories it's advised to store the `zstd` encoded repodata file with the mediaType `application/vnd.conda.repodata.v1+json+zst`.
beckermr marked this conversation as resolved.
Show resolved Hide resolved
wolfv marked this conversation as resolved.
Show resolved Hide resolved
beckermr marked this conversation as resolved.
Show resolved Hide resolved

Other encoding are also accepted:
wolfv marked this conversation as resolved.
Show resolved Hide resolved

- `application/vnd.conda.repodata.v1+json+gzip`
- `application/vnd.conda.repodata.v1+json+bz2`

For `jlap`, the following mediaType is used:

- `application/vnd.conda.jlap.v1`

The `jlap` file should also be stored under the `<channel>/<subdir>/repodata.json` path as an additional layer.
Loading