Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CEP for repodata hotfixing #88

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
246 changes: 246 additions & 0 deletions cep-999.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
# Repodata hotfixing

<table>
<tr><td> Title </td><td> Repodata hotfixing - documenting today's state</td>
<tr><td> Status </td><td> Proposed </td></tr>
<tr><td> Author(s) </td><td> Michael Sarahan &lt;[email protected]&gt;</td></tr>
<tr><td> Created </td><td> August 6, 2024</td></tr>
<tr><td> Updated </td><td> TBD </td></tr>
<tr><td> Discussion </td><td>https://github.com/conda-incubator/ceps/pull/54</td></tr>
<tr><td> Implementation </td><td>https://github.com/prefix-dev/rattler-build</td></tr>
</table>

## Abstract

It is impossible to have completely correct repository metadata for all time.
New dependencies come along, unexpected API breaks happen, schemes to exclude
packages end up bringing in old packages, and things generally break for confusing
reasons. In the conda ecosystem, repodata comes from packages, but resides in an
aggregate location, repodata.json. This gives us the opportunity to patch
repodata in between the time that the repodata is loaded from the package, and
when the repodata.json is served to the user. This hotfixing capability avoids
the need to remove content that otherwise breaks things, sometimes across wide
swaths of a package ecosystem.

## Mechanics

This has been in practice since 2018, with the work landing in conda-build in
https://github.com/conda/conda-build/pull/3091. The general principle is that
there are "instructions" that take effect for specific package names. This was historically
generated with python code that acts on the original repodata.json. For example,
conda-forge has [many instructions](https://github.com/conda-forge/conda-forge-repodata-patches-feedstock/blob/main/recipe/gen_patch_json.py).

The patches-as-code worked fine for Anaconda, where the notion of hotfixing
originated. However, Anaconda could not offer to run code from arbitrary users
and organizations to do hotfixing for them. For this reason, instructions as
JSON files were created. Conda-forge could hotfix their repodata by providing a
JSON file of instructions, which Anaconda would consume when generating the
index for conda-forge's packages. The actual hotfixing has historically taken
place when mirroring an organization's packages to a CDN. This is a special
service that Anaconda offers to some especially large users, because it reduces
Anaconda's bandwidth costs, but raises the upkeep maintenance for them. Any of
these hotfix instructions assume that conda-index is being used for generating
the index. The web service behind anaconda.org does not run conda-index, and
doesn't currently have a means of hotfixing repodata.

Conda-forge created a package that generated patch instructions files when the
recipe was built. Conda-forge then informed Anaconda of the name of that
package, and Anaconda configured the mirroring process to download the
patch-containing package, and provide it to conda-index.

Over time, [the python file that conda-forge used to generate patch
instructions](https://github.com/conda-forge/conda-forge-repodata-patches-feedstock/blob/main/recipe/gen_patch_json.py)
became unwieldy. Matthew Becker added [a more manageable system to specify patches using YAML](https://github.com/conda-forge/conda-forge-repodata-patches-feedstock/pull/498). That system is proposed as the standard way to specify patches going forward.



## Specifications

## Patch instructions

Patch instructions JSON files look like:

```json
instructions = {
"patch_instructions_version": 1,
"packages": {
"mypackage-1.0.tar.bz2": {
"depends": ["c", "d", "e"]
}
},
"revoke": [],
"remove": [],
}
```

The schema for valid values in the patch instructions matches the schema for
repodata.json itself. Patching means simply replacing the content in
repodata.json with content in the patch. There is no notion of appending or
modifying data; only replacing it.


## Repodata patch YAML specification

This specification is copied wholesale from conda-forge, where it was developed

The files in this directory are used to construct repodata patches for conda-forge.
Typically, a single feedstock will have a single YAML file specifying the patches
for the packages produced by that feedstock. Use comments liberally to describe
why the patch exists.

Patches are specified by two main blocks.

- The `if` block specifies a set of conditions under which the changes in the `then` block are applied.
- The different conditions in the `if` block are combined with a logical `AND`.
- Any condition may be prefixed by `not_` and will be negated.
- The `if` conditions can use shell glob syntax as implemented in the python `fnmatch` module in the
standard library. The optional "?( *)" pattern from extended glob syntax is allowed to match zero or
one sequences of spaces plus any other characters.
- The `then` section uses the Python `string.Template` system to allow the `version`, `build_number`, `name`, or
`subdir` values to be inserted into strings via templates (e.g., `"blah <=${version}"`) at runtime.
- Multiple patches can be in the same file using separate YAML documents (i.e., separate the data by `---`
on a new line).

```yaml
if:
# possible conditions
# list of subdirs or a single subdir (e.g., "linux-64")
subdir_in: linux-64
# any subdir but linux-64
not_subdir_in: linux-64

# list of artifact names or a single name (e.g., "ngmix-2.3.0-py38h50d1736_1.conda")
artifact_in: ngmix-2.3.0-py38h50d1736_1.conda

# any key in the repodata entry (e.g., "version" or "build_number") with an operation
<repodata key>_<ge, gt, le, lt>: <value>
# this means version > 1.0.0
version_gt: 1.0.0
# keeps any record with timestamp < value
# you can generate the current time via
# python -c "import time; print(f'{time.time():.0f}000')"
timestamp_lt: 1633470721000

# any key in the repodata entry (e.g., "version" or "build_number") and a list of values or single value
<repodata key>_in: <list or single item>

# this means the build number is in the set {0, 1, 2}
build_number_in: [0, 1, 2]

# has specific dependencies as either a list or a single string
has_depends: numpy* # matches 'numpy', 'nump-blah', or 'numpy 5.6'
has_depends: numpy?( *) # matches 'numpy' or 'numpy 5.6' but not 'numpy-blah'
has_depends: numpy # matches "numpy" exactly (i.e., no pins)

# has specific constraints as either a list or a single key
has_constrains: numpy* # matches 'numpy', 'nump-blah', or 'numpy 5.6'
has_constrains: numpy?( *) # matches 'numpy' or 'numpy 5.6' but not 'numpy-blah'
has_constrains: numpy # matches "numpy" exactly (i.e., no pins)

# single value for a key that should match
<repodata key>: <value>
version: 1.0.0
then:
# list of instructions to change things

# add to the depends or constrains section of the repodata
# this function will not add items already present in the record
- add_<depends or constrains>: <list of str or single str>
# you can use data from the record being patched like this
# only name, version, build_number and subdir are supported
- add_depends: mypackage <=${version}

# remove from the depends or constrains sections of the repodata
- remove_<depends or constrains>: <list of str or single str>

# remove entries from track_features
- remove_track_features: <list of str or str>

# add entries to track_features
- add_track_features: <list of str or str>

# reset the depends or constrains section of the repodata
# this function resets the depends or constrains to the specified value(s)
- reset_<depends or constrains>: <list of str or single str>
# you can use data from the record being patched like this
# only name, version, build_number and subdir are supported
- reset_depends: mypackage <=${version}

# replace entries via an exact match in either the depends or constrains sections
- replace_<depends or constrains>:
# str of thing to be replaced
old: matplotlib ==1.3.0
# thing to replace `old` with
new: matplotlib-base ==1.4.0
# globs are allowed in the "old" field so * needs to be escaped via [*]
- replace_<depends or constrains>:
# str of thing to be replaced
old: matplotlib 1.3.[*] # matches matplotlib 1.3.* exactly
# thing to replace `old` with
new: matplotlib-base ==1.4.0
- replace_<depends or constrains>:
# str of thing to be replaced
old: matplotlib 1.3.* # matches matplotlib 1.3.0, matplotlib 1.3, etc.
# thing to replace `old` with
new: matplotlib-base ==1.4.0
- replace_<depends or constrains>:
# str of thing to be replaced
old: matplotlib ==1.3.0
# thing to replace `old` with
new: ${old},<1.4.0 # you can refer to the "old" value as well

# rename a dependency - this preserves the version information and simply renames the package
- rename_<depends or constrains>:
# str of thing to be renamed
old: matplotlib
# new name for thing
new: matplotlib-base

# relax an exact pin (e.g., blah ==1.0.0) to something like blah >=1.0.0 and possibly with
# `,<2.0a0` added if max_pin='x'
- relax_exact_depends:
# the package name whose constraint should be relaxed
name: matplotlib
# optional string of 'x', 'x.x' etc. format specify an upper bound
# if not given, no upper bound is applied
# max_pin: 'x.x'

# make a dependency version constraint stricter
- tighten_depends:
# package to pin stricter
name: matplotlib # this field can use the fnmatch glob syntax
# you must give one of max_pin or upper_bound
# optional way to specify the new maximum pin as 'x', 'x.x', etc.
max_pin: 'x.x'
# optional way to specify upper bound explicitly
# do not use with `max_pin`
upper_bound: 2.0.1

# make a dependency version constraint looser
- loosen_depends:
# package to pin looser
name: matplotlib # this field can use the fnmatch glob syntax
# you must give one of max_pin or upper_bound
# optional pinning expression 'x', 'x.x', etc. to set how much looser to make the pin
max_pin: 'x.x'
# optional way to specify upper bound explicitly
# do not use with `max_pin`
upper_bound: 2.0.1
---
# more than one patch can be in the file by putting the next one here as a new YAML doc
if:
...
then:
...
---
if:
...
then:
...
```

> [!WARNING]
> The condition `timestamp_lt` is required to prevent your patch from modifying
> any packages built in the future. Don't forget to calculate it with `python -c
> "import time; print(f'{time.time():.0f}000')"` and include it in the `if:`
> section of your patch
Loading