Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] OpenShift Mirror Registry Upgrade Proposal #11

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
183 changes: 183 additions & 0 deletions enhancements/mirror-registry-upgrade.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
---
title: Mirror Registry Upgrade
authors:
- "@HammerMeetNail"
reviewers:
- "@joking"
approvers:
- "@joking"
creation-date: 2022-01-20
last-updated: 2022-01-20
status: provisional
---

# mirror-registry-upgrade
The mirror registry for Red Hat OpenShift is meant to be a long live registry that lives on a single RHEL host. It supports mirror for the last three version of OCP and must be able to handle upgrades gracefully. This proposal details how mirror registry will be able to upgrade to new versions, e.g., 1.0 -> 1.1.0.

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA

## Open Questions [optional]

> 1. When should an upgrade occur?
> 2. How should we handle certificate renewal?
> 3. Should we add a new command for upgrade of should install command trigger upgrade?
> 4. What would a self upgrading mirror registry look like in a disconnected environment?

## Summary
The mirror registry will need to be upgraded whenever a new y-stream release of `Quay` or `OCP` occurs. It should be upgraded whenever a z-stream release occurs for `Quay`, `Postgres`, or `Redis`. Certificates will be renewed at time of upgrade, which should occur at least once per year. Existing installation will be detected and no additional information should be required for upgrade. Upgrades will be idempotent.

### Goals

* Users should be able to upgrade to newer versions of mirror registry with ease
* Existing mirror registry installations should be detected and settings inherited
* Autogenerated certificates should be renewed on upgrade

### Non-Goals
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if you want to consider it in-scope for this particular proposal or not, but another maintenance task folks will want to perform with a long-running registry is pruning old images to avoid unbounded storage growth. In the OpenShift-mirror context, this looks like:

  1. User decides to install OpenShift release oA.
  2. User installs a new mirror registry, which happens to be version mA.
  3. New OpenShift releases oB, oC, ... come out and get mirrored into the mirror registry.
  4. New mirror-registry releases mB, mC, ... come out, and the admin updates their mirror to use those versions.
  5. User realizes that they really don't need oA and oB in the mirror registry anymore, so they unpin them.
  6. Registry reaper removes all the layers and manifests and whatnot that were only needed for those unpinned OpenShift releases.

(4) is definitely in-scope for this enhancement. I'm wondering whether (5) is in-scope. I assume (6) would work out of the box, but if not, I'm curious about whether that's in-scope here too.

(5) isn't strictly unique to the mirror-registry workflow. Folks might theoretically want that with highly-available registry implementations as well. But while a few GiB of stale content on a highly-available registry isn't that big a deal, that same amount of cruft on the single-node mirror-registry deployment seems like it might be something that folks would care about. And that's especially likely once we're throwing around TiBs.


* Upgrades to host level dependencies such as `podman`

## Proposal

This is where we get down to the nitty gritty of what the proposal actually is.

### User Stories [optional]

Detail the things that people will be able to do if this is implemented.
Include as much detail as possible so that people can understand the "how" of
the system. The goal here is to make this feel real for users without getting
bogged down.

#### Story 1

#### Story 2

### Implementation Details/Notes/Constraints [optional]

What are the caveats to the implementation? What are some important details that
didn't come across above. Go in to as much detail as necessary here. This might
be a good place to talk about core concepts and how they releate.

### Risks and Mitigations

What are the risks of this proposal and how do we mitigate. Think broadly. For
example, consider both security and how this will impact the larger Operator Framework
ecosystem.

How will security be reviewed and by whom? How will UX be reviewed and by whom?

Consider including folks that also work outside your immediate sub-project.

## Design Details

### Test Plan

**Note:** *Section not required until targeted at a release.*

Consider the following in developing a test plan for this enhancement:
- Will there be e2e and integration tests, in addition to unit tests?
- How will it be tested in isolation vs with other components?

No need to outline all of the test cases, just the general strategy. Anything
that would count as tricky in the implementation and anything particularly
challenging to test should be called out.

All code is expected to have adequate tests (eventually with coverage
expectations).

### Graduation Criteria

**Note:** *Section not required until targeted at a release.*

Define graduation milestones.

These may be defined in terms of API maturity, or as something else. Initial proposal
should keep this high-level with a focus on what signals will be looked at to
determine graduation.

Consider the following in developing the graduation criteria for this
enhancement:
- Maturity levels - `Dev Preview`, `Tech Preview`, `GA`
- Deprecation

Clearly define what graduation means.

#### Examples

These are generalized examples to consider, in addition to the aforementioned
[maturity levels][maturity-levels].

##### Dev Preview -> Tech Preview

- Ability to utilize the enhancement end to end
- End user documentation, relative API stability
- Sufficient test coverage
- Gather feedback from users rather than just developers

##### Tech Preview -> GA

- More testing (upgrade, downgrade, scale)
- Sufficient time for feedback
- Available by default

**For non-optional features moving to GA, the graduation criteria must include
end to end tests.**

##### Removing a deprecated feature

- Announce deprecation and support policy of the existing feature
- Deprecate the feature

### Upgrade / Downgrade Strategy

If applicable, how will the component be upgraded and downgraded? Make sure this
is in the test plan.

Consider the following in developing an upgrade/downgrade strategy for this
enhancement:
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade in order to keep previous behavior?
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade in order to make use of the enhancement?

### Version Skew Strategy

How will the component handle version skew with other components?
What are the guarantees? Make sure this is in the test plan.

Consider the following in developing a version skew strategy for this
enhancement:
- During an upgrade, we will always have skew among components, how will this impact your work?
- Does this enhancement involve coordinating behavior in the control plane and
in the kubelet? How does an n-2 kubelet without this feature available behave
when this feature is used?
- Will any other components on the node change? For example, changes to CSI, CRI
or CNI may require updating that component before the kubelet.

## Implementation History

Major milestones in the life cycle of a proposal should be tracked in `Implementation
History`.

## Drawbacks

The idea is to find the best form of an argument why this enhancement should _not_ be implemented.

## Alternatives

Similar to the `Drawbacks` section the `Alternatives` section is used to
highlight and record other possible approaches to delivering the value proposed
by an enhancement.

## Infrastructure Needed [optional]

Use this section if you need things from the project. Examples include a new
subproject, repos requested, github details, and/or testing infrastructure.

Listing these here allows the community to get the process for these resources
started right away.

128 changes: 128 additions & 0 deletions enhancements/proxy-cache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these and the other non-update enhancement files attached to this pull request accidentally? I don't see how they are relevant to the mirror-registry-upgrade.md proposal.

title: Quay as a Proxy Cache
authors:
- "@fmissi"
- "@sleesinc"
- "@sdadi"
reviewers:
- "@fmissi"
- "@sleesinc"
- "@sdadi"
approvers:
- "@fmissi"
- "@sleesinc"
- "@sdadi"
creation-date: 2021-12-10
last-updated: 2021-12-10
status: implementable
---

# Quay as a cache proxy for upstream registries

Container development has become widely popular. Customers today rely on container images from upstream registries(like docker, gcp) to get desired services up and running.
Registries now have rate limitation and throttling on the number of times users can pull from these registries.
This proposal is to enable Quay to act as a pull through cache where, once pulled images are only pulled again when upstream images have been updated.

## Release Signoff Checklist

- [x] Enhancement is `implementable`
- [x] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA

## Open Questions

> 1. Will the check against the upstream registry count against an upstream rate limit?
> 2. Currently, How can we track when an image was last used (to implement eviction based on LRU images)?
> 3. How will time machine work on images that are past its staleness period?

## Summary

Dependencies on container images has increased tremendously with the adoption of container driven development. With the introduction of rate limits
on popular container registries, Quay will act as a proxy cache to circumvent pull rate limitation from upstream registries. Adding a cache will also
accelerate pull performance as images are pulled from cache rather than upstream dependencies. Cached images are updated only when the upstream
image digest differs from cached image.

### Goals

* A Quay user can define and configure(credentials, staleness period) via config file/app, a repository in Quay that acts a cache for a specific upstream registry.
* A Quay super can leverage storage quota of an organization to limit cache size. This means when cache size reaches its quota limit,
images from cache are evicted based on LRU.
* A proxy cache organization will transparently cache and stream images to client. The images in the proxy cache organization should
support the default expected behaviour (like security scanner, time machine, etc) as other images on Quay.
* Given the sensitive nature of accessing potentially untrusted upstream registry all cache pulls needs to be logged (audit log).
* A Quay user can flush the cache to eliminate excess storage consumption.
* Robot accounts can be created in the cache organization as usual, their RBAC can be managed for all existing cached repositories at a given point in time.
* Provide metrics to track cache activity and efficiency (hit rate, size, evictions).

### Non-Goals

* In the first phase, configuring a cache proxy organization, caching upstreaming images, and quota management on cached repositories is the target.
Other goals will be implemented subsequently based on the work of this proposal.
* Cached images are read-only, which means that images cannot be pushed into Proxy Cache Organization.

## Design Details

The expected pull overview is depicted as below:
![](https://user-images.githubusercontent.com/11522230/145866763-58f44c94-839b-4edb-a95b-b9c3648cf187.png)
Design credits: @fmissi

A user initiates a pulls of an image(say a `postgres:14` image) from an upstream repository on Quay. The repository is checked to see if the image is present.
1. If the image does not exist, a fresh pull is initiated.
* The user is authenticated into the upstream registry. Authetication is not a requirement. For public repositories, docker hub supports anonymous pulls.
But, for private repositories, user should be authenticated.
* The layers of `postgres:14` are pulled.
* The pulled layers are saved to cache and served to the user in parallel.
* This is depicted as below:
![](https://user-images.githubusercontent.com/11522230/145871778-da01585a-7b1b-4c98-903f-809c214578da.png)
Design credits: @fmissi

2. If the image exists in cache:
* A user can rely on Quay's cache to stay coherent with the upstream source so that I transparently get newer images from the cache
when tags have been overwritten in the upstream registry immediately or after a certain period of time.
* If the upstream image and cached version of the image are same:
* No layers are pulled from the upstream repository and the cached image is served to the user.

* If the upstream image and cached version of the image are different:
* The user is authenticated into the upstream registry and only the changed layers of `postgres:14` are pulled.
* The new layers are updated in cache and served to the user in parallel.
* This is depicted as below:
![](https://user-images.githubusercontent.com/11522230/145872216-31350e08-6746-4e34-aebf-e59a7bf6b372.png)
Design credits: @fmissi

3. If user initiates a pull when the upstream registry which is down:
* If the pull happens with the configured staleness period, the image stored in cache is served.
* If the pull happens after the configured staleness period, the error is propagated to the user.
* This is depicted as below:
![](https://user-images.githubusercontent.com/11522230/145878373-c23d094b-709d-4859-b875-013ea33e34f7.png)
Design credits: @fmissi

A quay admin can leverage the configurable size limit of an organization to limit cache size so the backend storage consumption remains predictable
by discarding images from the cache according to least recently used frequency.
This is depicted as below:
![](https://user-images.githubusercontent.com/11522230/145884935-df19297f-96b5-4c1c-9cdc-e199e04df176.png)
Design credits: @sdadi

A user initiates a pulls of an image(say a `postgres:14` image) from an upstream repository on Quay. If the storage consumption of the organization
is beyond the configured size limit, images in the namespace are removed based on least recently used, to make space for `postgres:14` to be cached.

### Constraints

* If size limit is configured in a proxy cache organization, and say an org is set to a max size of 500mb and the image the user wants to pull is 700mb.
In such a case, the image pulled will be cached and will overflow beyond the configured limit.

### Risks and Mitigations

* The cached images should have all properties that images on a Quay repository would have.
* When the client wants to pull a new image, Quay should reuse already cached layers and download from the upstream repository only new ones.

### Implementation Plan

* Pull a fresh image and check that layers are saved to cache.
* Pull an outdated image and check that only changed layers are saved to cache.
* Implement configuring of a proxy cache organization.
* Implement quota(configurable size limit) management (blocked by [quota mangement feature](https://issues.redhat.com/browse/PROJQUAY-302))

### Implementation History

* 2021-12-13 Initial proposal
Loading