Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stable 2.12.5+plaid 1 :: [PATCH] Fix bug where topology routing would not disable while service was under load. #191

Draft
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

jandersen-plaid
Copy link
Owner

github.com/linkerd#10925

Add support for enabling and disabling topology aware routing when hints are added/removed.

The testing setup is very involved because it involves so many moving parts

  1. Setup a service which is layered over several availability zones.
    1a) The best way to do this is one service object, with 3 replicasets explicitly forced to use a specific AZ each.

  2. Add service.kubernetes.io/topology-aware-hints: Auto annotation to the Service object

  3. Use a load tester like k6 to send meaningful traffic to your service but only in one AZ

  4. Scale up your replica sets until k8s adds Hints to your endpointslices

  5. Observe that traffic shifts to only hit pods in one AZ

  6. Turn down the replicasets count until such time that K8s removes the hints from your endpointslices

  7. Observe traffic shifts back to all pods across all AZ.

Note: Patch applied on top of stable-2.12.5 with small adjustments

Opening as a PR to keep track of the branch.

alpeb and others added 30 commits October 19, 2022 11:54
## stable-2.12.2

This stable release fixes an issue with CNI chaining that was preventing the
Linkerd CNI plugin from working with other CNI plugins such as Cilium. It also
fixes some sections of the Viz dashboard appearing blank, and adds an optional
PodMonitor resource to the Helm chart to enable easier integration with the
Prometheus Operator. Several other fixes are included.

* Proxy
  * Fixed proxies emitting some duplicate inbound metrics

* Control Plane
  * Fixed handling of `.conf` files in the CNI plugin so that the Linkerd CNI
    plugin can be used alongside other CNI plugins such as Cilium
  * Added a noop init container to injected pods when the CNI plugin is enabled
    to prevent certain scenarios where a pod can get stuck without an IP address
  * Fixed the `NotIn` label selector operator in the policy resources being
    erroneously treated as `In`.
  * Fixed a bug where the`config.linkerd.io/proxy-version` annotation could be
    empty

* CLI
  * Added a `linkerd diagnostics policy` command to inspect Linkerd policy state
  * Added a check that ClusterIP services are in the cluster networks
  * Expanded the `linkerd authz` command to display AuthorizationPolicy
    resources that target namespaces (thanks @aatarasoff!)
  * Fixed warning logic in the "linkerd-viz ClusterRoles exist" and "linkerd-viz
    ClusterRoleBindings exist" checks in `linkerd viz check`
  * Fixed the CLI ignoring the `--api-addr` flag (thanks @mikutas!)

* Helm
  * Added an optional PodMonitor resource to the main Helm chart (thanks
    @jaygridley!)

* Dashboard
  * Fixed the dashboard sections Tap, Top, and Routes appearing blank (thanks
    @MoSattler!)
  * Updated Grafana dashboards to use variable duration parameter so that they
    can be used when Prometheus has a longer scrape interval (thanks @TarekAS)
)

Currently, the `noop` init container created by the Linkerd CNI plugin
causes issues when a workload with
```yaml
securityContext:
  runAsNonRoot: true
```
is injected, since it will add a container that runs as root to that
workload.

This branch resolves this issue by changing the Helm template for the
noop init container to use the same user as the `proxyInit` init
container. I've tested this by injecting a deployment with the above
`securityContext` configuration and verifying that the `noop` init
container is now allowed to run.

This PR is against the `release/stable-2.12` branch, as the `noop` init
container has been removed on the edge branch (as it was replaced with
the CNI validator init container).

Fixes linkerd#9671
… no ClusterIP (linkerd#9662)

Fixes linkerd#9661

This excludes any service with no ClusterIP from this check, which
includes the services of type ExternalName.
Signed-off-by: Steve Jenson <[email protected]>

Signed-off-by: Steve Jenson <[email protected]>
When installing the multicluster extension through the CLI, the gateway's `pause` container `runAsUser` field is empty. K8s then uses the UID defined in the `pause` image, which is [65535](https://github.com/kubernetes/kubernetes/blob/master/build/pause/Dockerfile#L19).

The source of the problem is that the `gateway.UID` values.yaml entry isn't backed by an entry in the multicluster `values.go`'s `Gateway` struct.

How to repro:

```bash
# before the fix
$ linkerd mc install --ignore-cluster | grep runAsUser
            runAsUser:

# after the fix
$ linkerd mc install --ignore-cluster | grep runAsUser
            runAsUser: 2103
```
…linkerd#9575)

Having the proxyProtocol listed as HTTP/1 in the multicluster gateway Server is confusing because this value is actually unused in the case of multicluster (since Linkerd wraps all multicluster traffic in its own opaque transport protocol). We delete the proxyProtocol line here altogether (unknown is the default) to invite the least confusion.

Fixes linkerd#9574

Signed-off-by: Peter Smit <[email protected]>
https://github.com/linkerd/linkerd2/blob/main/web/app/index_bundle.js.lodash.tmpl#L4-L17

Some browser plugins will insert script tags in HTML page, resulting in wrong
root paths

Fixes: linkerd#9438

Signed-off-by: Ye Sijun <[email protected]>
Fix upgrade when using --from-manifests

When the `--from-manifests` flag is used to upgrade through the CLI,
the kube client used to fetch existing configuration (from the
ConfigMap) is a "fake" client. The fake client returns values from a
local source. The two clients are used interchangeably to perform the
upgrade; which one is initialized depends on whether a value has been
passed to `--from-manifests`.

Unfortunately, this breaks CLI upgrades to any stable-2.12.x version
when the flag is used. Since a fake client is used, the upgrade will
fail when checking for the existence of CRDs, even if the CRDs have been
previously installed in the cluster.

This change fixes the issue by first initializing an actual Kubernetes
client (that will be used to check for CRDs). If the values should be
read from a local source, the client is replaced with a fake one. Since
this takes place after the CRD check, the upgrade will not fail on the
CRD precondition.

Fixes linkerd#9788

Signed-off-by: Matei David <[email protected]>
When calling `linkerd upgrade, if the `linkerd-config-overrides` Secret is not found then we ask the user to run `linkerd repair`, but that has long been removed from the CLI.
Also removed code comment as the error is explicit enough.
* Use self-hosted runner for ARM64 integration tests

This refactors the "ARM64 integration tests" job in `relase.yaml` to
use an ARM self-hosted runner tagged with `[self-hosted, Linux, ARM64]`,
tied to the linkerd github org.

We no longer use a local (linux/x86_64) linkerd CLI that connects to an
existing k3s instance in the host. Instead, we run the CLI ARM64 binary
in the host itself, after creating the cluster with k3d (which gets
always torn down at the end of the tests regardless of their success).

Please check the "ARM CI host at Equinix Metal" doc in Notion for the
host setup.

## Other Changes

- The cni test was removed.
- Replaced `"$bindir"/docker` with just `docker` in `bin/image-load` as
  we do elsewhere.
- Properly detect k3d arch in `bin/k3d`
The problem was our `TAG` environment variable (set to `edge-22.11.2`)
which was conflicting with an env var of the same name in the k3d
install.sh script.
…e network (linkerd#9819)

Maps the request port to the container's port if the request comes in from the node network and has a hostPort mapping.

Problem:

When a request for a container comes in from the node network, the node port is used ignoring the hostPort mapping.

Solution:

When a request is seen coming from the node network, get the container Port from the Spec.

Validation:

Fixed an existing unit test and wrote a new one driving GetProfile specifically.

Fixes linkerd#9677 

Signed-off-by: Steve Jenson <[email protected]>
This change aims to solve two distinct issues that have cropped up in
the proxy-init configuration.

First, it decouples `allowPrivilegeEscalation` from running proxy-init
as root. At the moment, whenever the container is run as root, privilege
escalation is also allowed. In more restrictive environments, this will
prevent the pod from coming up (e.g security policies may complain about
`allowPrivilegeEscalation=true`). Worth noting that privilege escalation
is not necessary in many scenarios since the capabilities are passed to
the iptables child process at build time.

Second, it introduces a new `privileged` value that will allow users to
run the proxy-init container without any restrictions (meaning all
capabilities are inherited). This is essentially the same as mapping
root on host to root in the container. This value may solve issues in
distributions that run security enhanced linux, since iptables will be
able to load kernel modules that it may otherwise not be able to load
(privileged mode allows the container nearly the same privileges as
processes running outside of a container on a host, this further allows
the container to set configurations in AppArmor or SELinux).

Privileged mode is independent from running the container as root. This
gives users more control over the security context in proxy-init. The
value may still be used with `runAsRoot: false`.

Fixes linkerd#9718

Signed-off-by: Matei David <[email protected]>
Fixes linkerd#9896

The maps in `endpointTranslator` weren't being guarded against
concurrent access, so we're adding locks at the `Add` and `Remove`
methods. Also these functions ultimately call the `SendMsg` method on
the gRPC `stream`, which is not
["thread-safe"](https://github.com/grpc/grpc-go/blob/master/stream.go#L122-L126),
so we're guarding against other problems as well.

A new unit test `TestConcurrency` was added that failed in the following
ways before this fix:

When running the test with the `-race` flag, we immediately get the data race warning:

```bash
$ go test ./controller/api/destination/... -run TestConcurrency -race
time="2022-11-25T16:48:52-05:00" level=info msg="waiting for caches to sync"
time="2022-11-25T16:48:52-05:00" level=info msg="caches synced"
==================
WARNING: DATA RACE
Read at 0x00c0000c0040 by goroutine 161:
  github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).Add()
      /home/alpeb/pr/destination-panic/linkerd2/controller/api/destination/endpoint_translator.go:80 +0x29c
  github.com/linkerd/linkerd2/controller/api/destination.TestConcurrency.func1()
      /home/alpeb/pr/destination-panic/linkerd2/controller/api/destination/endpoint_translator_test.go:338 +0x92

Previous write at 0x00c0000c0040 by goroutine 162:
  github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).sendFilteredUpdate()
      /home/alpeb/pr/destination-panic/linkerd2/controller/api/destination/endpoint_translator.go:95 +0x66
  github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).Add()
      /home/alpeb/pr/destination-panic/linkerd2/controller/api/destination/endpoint_translator.go:83 +0x330
  github.com/linkerd/linkerd2/controller/api/destination.TestConcurrency.func1()
      /home/alpeb/pr/destination-panic/linkerd2/controller/api/destination/endpoint_translator_test.go:338 +0x92

Goroutine 161 (running) created at:
  github.com/linkerd/linkerd2/controller/api/destination.TestConcurrency()
      /home/alpeb/pr/destination-panic/linkerd2/controller/api/destination/endpoint_translator_test.go:336 +0x6f
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1439 +0x213
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1486 +0x47

Goroutine 162 (running) created at:
  github.com/linkerd/linkerd2/controller/api/destination.TestConcurrency()
      /home/alpeb/pr/destination-panic/linkerd2/controller/api/destination/endpoint_translator_test.go:336 +0x6f
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1439 +0x213
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1486 +0x47
```

If run without the `-race` flag, we get the `concurrent map writes` panic reported in linkerd#9896:

```bash
$ go test ./controller/api/destination/... -run TestConcurrency -count=1
time="2022-11-25T16:53:25-05:00" level=info msg="waiting for caches to sync"
time="2022-11-25T16:53:25-05:00" level=info msg="caches synced"
fatal error: concurrent map writes

goroutine 187 [running]:
runtime.throw({0x1b57bc4?, 0x500000000000000?})
        /usr/local/go/src/runtime/panic.go:992 +0x71 fp=0xc00013dc80 sp=0xc00013dc50 pc=0x43a5b1
runtime.mapassign(0xc00013dec8?, 0x2?, 0x0?)
        /usr/local/go/src/runtime/map.go:595 +0x4d6 fp=0xc00013dd00 sp=0xc00013dc80 pc=0x4113b6
github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).Add(...)
        /home/alpeb/pr/destination-panic/linkerd2/controller/api/destination/endpoint_translator.go:80
github.com/linkerd/linkerd2/controller/api/destination.TestConcurrency.func1()
        /home/alpeb/pr/destination-panic/linkerd2/controller/api/destination/endpoint_translator_test.go:338 +0x1a8 fp=0xc00013dfe0 sp=0xc00013dd00 pc=0x16d1da8
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc00013dfe8 sp=0xc00013dfe0 pc=0x46d721
created by github.com/linkerd/linkerd2/controller/api/destination.TestConcurrency
        /home/alpeb/pr/destination-panic/linkerd2/controller/api/destination/endpoint_translator_test.go:336 +0x3c
```
…nkerd#9918)

When performing the HostPort mapping introduced in linkerd#9819, the `containsIP` iterates through the pod IPs searching for a match against `targetIP` using `ip.String()`, but that returns something like `&PodIP{IP: xxx}`. Fixed that to just use `ip.IP`, and also completed the text fixtures to include both `PodIP` and `PodIPs` in the pods manifests.

Note this wasn't affecting the end result, it was just producing an extra warning as shown below, that this change eliminates:

```bash
$ go test -v ./controller/api/destination/... -run TestGetProfiles
=== RUN   TestGetProfiles
...
=== RUN   TestGetProfiles/Return_profile_with_endpoint_when_using_pod_DNS
time="2022-11-29T09:38:48-05:00" level=info msg="waiting for caches to sync"
time="2022-11-29T09:38:49-05:00" level=info msg="caches synced"
time="2022-11-29T09:38:49-05:00" level=warning msg="unable to find container port as host (172.17.13.15) matches neither PodIP nor HostIP (&Pod{ObjectMeta:{pod-0  ns    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[linkerd.io/control-plane-ns:linkerd] map[] [] [] []},Spec:PodSpec{Volumes:[]Volume{},Containers:[]Container{},RestartPolicy:,TerminationGracePeriodSeconds:nil,ActiveDeadlineSeconds:nil,DNSPolicy:,NodeSelector:map[string]string{},ServiceAccountName:,DeprecatedServiceAccount:,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:nil,ImagePullSecrets:[]LocalObjectReference{},Hostname:,Subdomain:,Affinity:nil,SchedulerName:,InitContainers:[]Container{},AutomountServiceAccountToken:nil,Tolerations:[]Toleration{},HostAliases:[]HostAlias{},PriorityClassName:,Priority:nil,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[]PodReadinessGate{},RuntimeClassName:nil,EnableServiceLinks:nil,PreemptionPolicy:nil,Overhead:ResourceList{},TopologySpreadConstraints:[]TopologySpreadConstraint{},EphemeralContainers:[]EphemeralContainer{},SetHostnameAsFQDN:nil,OS:nil,HostUsers:nil,},Status:PodStatus{Phase:Running,Conditions:[]PodCondition{},Message:,Reason:,HostIP:,PodIP:172.17.13.15,StartTime:<nil>,ContainerStatuses:[]ContainerStatus{},QOSClass:,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},},})" test=TestGetProfiles/Return_profile_with_endpoint_when_using_pod_DNS
```
When CNI plugins run in ebpf mode, they may rewrite the packet
destination when doing socket-level load balancing (i.e in the
`connect()` call). In these cases, skipping `443` on the outbound side
for control plane components becomes redundant; the packet is re-written
to target the actual Kubernetes API Server backend (which typically
listens on port `6443`, but may be overridden when the cluster is
created).

This change adds port `6443` to the list of skipped ports for control
plane components. On the linkerd-cni plugin side, the ports are
non-configurable. Whenever a pod with the control plane component label
is handled by the plugin, we look-up the `kubernetes` service in the
default namespace and append the port values (of both ClusterIP and
backend) to the list.

On the initContainer side, we make this value configurable in Helm and
provide a sensible default (`443,6443`). Users may override this value
if the ports do not correspond to what they have in their cluster. In
the CLI, if no override is given, we look-up the service in the same way
that we do for linkerd-cni; if failures are encountered we fallback to
the default list of ports from the values file.

Closes linkerd#9817

Signed-off-by: Matei David <[email protected]>
* build(deps): bump actions/checkout from 3.0.2 to 3.1.0 (linkerd/linkerd2-proxy#1951)
* build(deps): bump arbitrary from 1.1.4 to 1.1.7 (linkerd/linkerd2-proxy#1953)
* build(deps): bump tj-actions/changed-files from 29.0.9 to 32.0.0 (linkerd/linkerd2-proxy#1952)
* build(deps): bump tj-actions/changed-files from 32.0.0 to 32.1.2 (linkerd/linkerd2-proxy#1958)
* build(deps): bump tokio-stream from 0.1.9 to 0.1.11 (linkerd/linkerd2-proxy#1954)
* build(deps): bump libfuzzer-sys from 0.4.3 to 0.4.5 (linkerd/linkerd2-proxy#1960)
* build(deps): bump anyhow from 1.0.64 to 1.0.65 (linkerd/linkerd2-proxy#1955)
* dev: Update to dev:v32 with Rust 1.64 (linkerd/linkerd2-proxy#1961)
* build(deps): bump actions/download-artifact from 3.0.0 to 3.0.1 (linkerd/linkerd2-proxy#1962)
* build(deps): bump prettyplease from 0.1.19 to 0.1.21 (linkerd/linkerd2-proxy#1963)
* build(deps): bump bumpalo from 3.11.0 to 3.11.1 (linkerd/linkerd2-proxy#1965)
* build(deps): bump actions/checkout from 3.0.2 to 3.1.0 (linkerd/linkerd2-proxy#1968)
* build(deps): bump actions/upload-artifact from 3.1.0 to 3.1.1 (linkerd/linkerd2-proxy#1966)
* build(deps): bump tj-actions/changed-files from 32.1.2 to 34.1.1 (linkerd/linkerd2-proxy#1972)
* build(deps): bump lock_api from 0.4.8 to 0.4.9 (linkerd/linkerd2-proxy#1976)
* build(deps): bump unicode-normalization from 0.1.21 to 0.1.22 (linkerd/linkerd2-proxy#1977)
* build(deps): bump extractions/setup-just from 1.4.0 to 1.5.0 (linkerd/linkerd2-proxy#1974)
* build(deps): bump tj-actions/changed-files from 34.1.1 to 34.3.2 (linkerd/linkerd2-proxy#1975)
* build(deps): bump tj-actions/changed-files from 34.3.2 to 34.3.4 (linkerd/linkerd2-proxy#1978)
* build(deps): bump rustls from 0.20.6 to 0.20.7 (linkerd/linkerd2-proxy#1979)
* build(deps): bump tonic-build from 0.8.0 to 0.8.2 (linkerd/linkerd2-proxy#1980)
* build(deps): bump syn from 1.0.99 to 1.0.103 (linkerd/linkerd2-proxy#1981)
* build(deps): bump smallvec from 1.9.0 to 1.10.0 (linkerd/linkerd2-proxy#1982)
* Bump hyper & h2 (linkerd/linkerd2-proxy#1983)
* build(deps): bump arbitrary from 1.1.7 to 1.2.0 (linkerd/linkerd2-proxy#1984)
* build(deps): bump num_cpus from 1.13.1 to 1.14.0 (linkerd/linkerd2-proxy#1985)

Signed-off-by: Oliver Gould <[email protected]>
Closes linkerd#10162.

This adds resource limits to the `noop` initContainer which will allow users who
require resource quotas to have a more seamless upgrade experience for stable
2.12 patches.

I chose the current values by halving the current resource limits of the
`proxy-init` initContainer; the `noop` initContainer basically does nothing so
we shouldn't run into issues with those limits.

The `noop` initContainer is replaced by the proxy-validator container in the
current edge releases, so this is a temporary fix that will allow users to
upgrade through the stable 2.12 patches. For this reason, I didn't add
additional templating to make this configurable.

Signed-off-by: Kevin Leimkuhler <[email protected]>
The output of `linkerd viz tap` cli command has wrong values for
latency/duration fields. This happens with default output format and
with `-o wide` option, but works well with `-o json`, dashboard also
shows proper values.

The solution is to display duration with `AsDuration().Microseconds()`.

Updated existing test + fixed couple of golden ones.

Fixes: linkerd#9878

Signed-off-by: Oleg Vorobev <[email protected]>
…kerd#10013)

Fixes linkerd#10003

When endpoints are removed from an EndpointSlice resource, the destination controller builds a list of addresses to remove.  However, if any of the removed endpoints have a Pod as their targetRef, we will attempt to fetch that pod to build the address to remove.  If that pod has already been removed from the informer cache, this will fail and the endpoint will be skipped in the list of endpoints to be removed.  This results in stale endpoints being stuck in the address set and never being removed.

We update the endpoint watcher to construct only a list of endpoint IDs for endpoints to remove, rather than fetching the entire pod object.  Since we no longer attempt to fetch the pod, this operation is now infallible and endpoints will no longer be skipped during removal.

We also add a `TestEndpointSliceScaleDown` test to exercise this.

Signed-off-by: Alex Leong <[email protected]>
…d#10071)

Helm chart has `identity.externalCA` value.
CLI code sets `identity.issuer.externalCA` and fails to produce the desired configuration. This change aligns everything to `identity.externalCA`.

Signed-off-by: Dmitry Mikhaylov <[email protected]>
Removed old `replace` directives in `go.mod` that are no longer
required, and updated the entry for `containerd` to address [
CVE-2022-23471](https://github.com/linkerd/linkerd2/security/dependabot/37)
Fixes linkerd#10164

The version of go-restful that we depend on has been flagged as a security vulnerability.  Even though this vulnerability does not affect Linkerd, we upgrade this dependency to silence security warnings.

Signed-off-by: Alex Leong <[email protected]>
adleong and others added 13 commits February 6, 2023 10:02
…0235)

Fixes linkerd#10138 

Evaluating Helm expressions like `.cpu.limit` will fail with a nil pointer dereference error if `.cpu` is nil.  If, for example, `.memory` is set but `.cpu` is not, the resources template will be executed but will fail.

We add parentheses to cause these expressions to be evaluated as a pipeline.  If the input to a pipeline stage is an empty value (such as nil), no output will be emitted to the next stage of the pipeline.  This allows for more graceful dereference chaining.  For example, when evaluating `(.cpu).limit`, if `(.cpu)` is nil, the rendering engine will not try to evaluate `nil.limit` but instead will emit no output for this expression.

Signed-off-by: Alex Leong <[email protected]>
Fixes linkerd#10036

The Linkerd control plane components written in go serve liveness and readiness probes endpoint on their admin server.  However, the admin server is not started until k8s informer caches are synced, which can take a long time on large clusters.  This means that liveness checks can time out causing the controller to be restarted.

We start the admin server before attempting to sync caches so that we can respond to liveness checks immediately.  We fail readiness probes until the caches are synced.

Signed-off-by: Alex Leong <[email protected]>
Fixes linkerd#8270

When a listener unsubscribes to port updates in Servers, we were
removing the listener for the `ServerWatcher.subscriptions` map, leaving
the map's key (`podPort` with holds the pod object and port) with an
empty value. In clusters where there's a lot of pod churn, those keys
with empty values were getting accumulated, so this change cleans that
up.

The repro (basically constantly rolling emojivoto) is described in
linkerd#9947.

A followup will be up shortly adding metrics to track these metrics,
along with similar missing metrics from other parts of Destination.
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.24.2 to 1.25.0.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/commits/tokio-1.25.0)

---
updated-dependencies:
- dependency-name: tokio
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…nkerd#10225)

Github actions has upgraded from `docker buildx 0.9.1+azure-2` to `buildx 0.10.0+azure-1` which by default adds provenance attestation to manifests (https://github.com/docker/buildx/releases/tag/v0.10.0).  This means that our platform specific images now contain multiple manifests because the attestation counts as a manifest:

```console
> docker buildx imagetools inspect ghcr.io/linkerd/policy-controller:edge-23.1.2-amd64 --format "{{ json .Manifest.Manifests }}"
[
  {
    "mediaType": "application/vnd.oci.image.manifest.v1+json",
    "digest": "sha256:1abeb519e76c71c7285b4435a3f85dd73f9c1982905a5a2ca59e0abb279f09aa",
    "size": 1055,
    "platform": {
      "architecture": "amd64",
      "os": "linux"
    }
  },
  {
    "mediaType": "application/vnd.oci.image.manifest.v1+json",
    "digest": "sha256:1254d52f1bd4ffd1c17688f39e01a6796b85bbcaf07bf83bdeb1c88ebe5b4657",
    "size": 566,
    "annotations": {
      "vnd.docker.reference.digest": "sha256:1abeb519e76c71c7285b4435a3f85dd73f9c1982905a5a2ca59e0abb279f09aa",
      "vnd.docker.reference.type": "attestation-manifest"
    },
    "platform": {
      "architecture": "unknown",
      "os": "unknown"
    }
  }
]
```

This causes the creation of our multi-arch image to fail because the `docker manifest create` command expects each of the constituent images to contain a single manifest each.

We set `--provenance=false` to skip adding the attestation manifest.

Signed-off-by: Alex Leong <[email protected]>
A missing `\` caused the `--provenance` flag to be interpreted as a separate command rather than a continuation of the previous one.  This caused the build action to fail.

Add the missing `\` character.

Signed-off-by: Alex Leong <[email protected]>
)

* increase memory limit for no-op container to 25 megabytes

Signed-off-by: Steve Jenson <[email protected]>

* go test ./... -update

Signed-off-by: Steve Jenson <[email protected]>

---------

Signed-off-by: Steve Jenson <[email protected]>
## stable-2.12.5

This stable release fixes an incompatibility issue with the AWS CNI addon in EKS
that was forbidding pods to acquire networking after scaling up nodes (thanks
@frimik!). It also includes security updates for dependencies.

* Detached the linkerd-cni plugin's version from linkerd's and bumped to v1.1.1
  to fix incompatibility with EKS' AWS CNI addon
* Bumped the memory limit for the no-op init container to 25Mi to address issues
  on OKE environments
* Updated `h2` dependency in the policy controller to include a patch for a
  theoretical denial-of-service vulnerability discovered in CVE-2023-26964
* Updated `openssl` dependency in the policy controller, addressing
  RUSTSEC-2023-0022, RUSTSEC-2023-0023 and RUSTSEC-2023-0024
…le service was under load. (github.com/linkerd#10925)

Add support for enabling and disabling topology aware routing when hints are added/removed.

The testing setup is very involved because it involves so many moving parts

1) Setup a service which is layered over several availability zones.
1a) The best way to do this is one service object, with 3 replicasets explicitly forced to use a specific AZ each.
2) Add `service.kubernetes.io/topology-aware-hints: Auto` annotation to the Service object
3) Use a load tester like k6 to send meaningful traffic to your service but only in one AZ
3) Scale up your replica sets until k8s adds Hints to your endpointslices
4) Observe that traffic shifts to only hit pods in one AZ

5) Turn down the replicasets count until such time that K8s removes the hints from your endpointslices
6) Observe traffic shifts back to all pods across all AZ.

Note: Patch applied on top of stable-2.12.5 with small adjustments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.