diff --git a/design-proposals/migration-target.md b/design-proposals/migration-target.md new file mode 100644 index 00000000..bc8a8c8d --- /dev/null +++ b/design-proposals/migration-target.md @@ -0,0 +1,151 @@ +# Overview +We are getting asks from multiple cluster admin that would like to explicitly specify the "destination" of the VM when doing Live migration. +While this may be less important in a cloud-native environment, +we get this ask from many users coming from other virtualization solutions, where this is a common practice. +The same result can already be achieved today with a few steps, this is only about simplifying it with a single direct API on the single `VirtualMachineInstanceMigration` without the need to alter a VM spec. + +## Motivation +In the ideal cloud native design, the scheduler is supposed to be always able to correctly identify +the best node to run a pod (so the target pod for the VMI after the live-migration) on. +In the real world, we still see specific use cases where the flexibility do explicitly and directly define the target node for a live migration is a relevant nice-to-have: +- Experienced admins are used to control where their critical workloads are move to +- Workload balancing solution doesn't always work as expected +- Troubleshooting a node +- Validating a new node migrating there a specific VM + +Such a capability is expected from traditional virtualization solutions but, with certain limitations, is still pretty common across the most popular cloud providers (at least when using dedicated and not shared nodes). +- For instance on Amazon EC2 the user can already live-migrate a `Dedicated Instance` from a `Dedicated Host` to another `Dedicated Host` explicitly choosing it from the EC2 console, see: https://repost.aws/knowledge-center/migrate-dedicated-different-host +- also on Google Cloud Platform Compute Engine the user can easily and directly live-migrate a VM from a `sole-tenancy` node to another one via CLI or REST API, see: https://cloud.google.com/compute/docs/nodes/manually-live-migrate#gcloud + +On the technical side something like this can already be indirectly achieved playing with node labels and affinity but nodeSelector and affinity are going to be defined as VM properties that are going to stay while here we are focusing just on setting the desired target of a one-off migration attempt without any future side effect on the VM. +The motivation is to better define a boundary between what is an absolute and long-lasting property of a VM (like affinity) with what is just an optional property of the single migration attempt. +This could also be relevant in terms of personas: we could have the VM owner/developer that is going to specify long-lasting affinity for a VM that is part of an application composed by different VMs and pods and a cluster admin/operator that needs to temporary override that for maintenance reasons. +On the other side the VM owner is not required/supposed to be aware of node names. + +## Goals +- A user allowed to trigger a live-migration of a VM and list the nodes in the cluster is able to rely on a simple and direct API to try to live migrate a VM to a specific node. +- The explict migration target overrules a nodeSelector or affinity and anti-affinity rules defined by the VM owner (still debated, see the `How to propagate the named node to the target virt-launcher pod` section). +- The live migration then can successfully complete or fail for various reasons exactly as it can succeed of fail today for other reasons. +- The target node that is explicitly required for the actual live migration attempt should not influence future live migrations or the placement in case the VM is restarted. For long-lasting placement, nodeSelectors or affinity/anti-affinity rules are the only way to go. + +## Non Goals +- this proposal is not defining a custom scheduler plugin nor suggesting to alter how the default k8s scheduler works with `nodeName`, `nodeSelector` and `affinity/anti-affinity` rules. See https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ for the relevant documentation + +## Definition of Users +- VM owner: the user who owns a VM in his namespace on a Kubernetes cluster with KubeVirt +- Cluster-admin: the administrator of the cluster + +## User Stories +- As a cluster admin I want to be able to try to live-migrate a VM to specific node for various possible reasons such as: + - I just added to the cluster a new powerful node and I want to migrate a selected VM there without trying more than once according to scheduler decisions + - I'm not using any automatic workload rebalancing mechanism and I periodically want to manually rebalance my cluster according to my observations + - Foreseeing a peak in application load (e.g. new product announcement), I'd like to balance in advance my cluster according to my expectation and not to current observations + - During a planned maintenance window, I'm planning to drain more than one node in a sequence, so I want to be sure that the VM is going to land on a node that is not going to be drained in a near future (needing then a second migration) and being not interested in cordoning it also for other pods + - I just added a new node and I want to validate it trying to live migrate a specific VM there +> [!NOTE] +> technically all of this can be already achieved manipulating the node affinity rules on the VM object, but as a cluster admin I want to keep a clear boundary between what is a long-lasting setting for a VM, defined by the VM owner, and what is single shot requirement for a one-off migration +- As a VM owner I don't want to see my VM object getting amended by another user just for maintenance reasons + +## Repos +- https://github.com/kubevirt/kubevirt + +# Design +## Proposed design +We are going to add a new optional `nodeName` string field on the `VirtualMachineInstanceMigration` object. +We are not going to alter by any mean the `spec` stanza of the VM or the VMI objects so future migrations or the node placement after a restart of the VM are not going to be affected by a `nodeName` set on a specific `VirtualMachineInstanceMigration` object. +If the target pod fails to be started, the `VirtualMachineInstanceMigration` object will be marked as failed as it can already happen today for other reasons. + +## How to propagate the named node to the target virt-launcher pod +We have two alternative approaches to propagate it to the target pod that are going to determine what is going to happen when the named node in the migration request is conflicting with other constrains set by the VM owner on the VM. + +### A. directly setting `spec.nodeName` on the target virt-launcher pod (bypassing the k8s scheduler) +If the `nodeName` field is not empty, the migration controller will explicitly set `nodeName` on the virt-launcher pod that is going to be used as the target endpoint for the live migration. +If the `nodeName` field is not empty, the k8s scheduler will ignore the Pod that is going to be used as the target for the migration and the kubelet on the named node will directly try to place the Pod on that node. + +When a pod is going to be executed, the scheduler is going to check it and, according to available cluster resources, nodeselectors, weighted affinity and anti-affinity rules and so on, +the scheduler is going to select a node and write its name on `spec.nodeName` on the pod object. At this point the kubelet on the named node will try to execute the Pod on that node. + +If `spec.nodeName` is already set on a pod object as in this approach, the scheduler is not going to be involved in the process since the pod is basically already scheduled for that node and only for tha named node and so the kubelet on that node will directly try to execute it there eventually failing. +Using `spec.nodeName` overrules nodeSelector or affinity and anti-affinity rules defined on the VM. +Taints with `NoSchedule` and `PreferNoSchedule` effects are also bypassed, taints with `NoExecute` effect are still effective if not explicitly tolerated. + +#### pro +- the kubelet on the named node will always try to execute a virt-launcher pod to be used as the target for the live migration regardless of how the VM is actually configured by the VM owner +- the cluster admin can always try to live migrate a VM to any node regardless of how the VM is actually configured by the VM owner +- the cluster admin is never required to amend affinity/anti-affinity rules set by the VM owner on the VM breaking the boundary between a long-lasting configuration and a single live migration attempt + +#### cons +- the cluster admin can easily bypass/break useful or even potenatilly critical affinity/anti-affinity rules set by the VM owner for application specific needs (e.g. two VMs of an application level cluster spread over two different nodes for HA reasons) +- taints with `NoSchedule` and `PreferNoSchedule` effect are also going to be ignored with a potentially unexpected behaviour + +### B. appending/merging an additional `nodeAffinity` rule on the target virt-launcher pod (merging it with VM owner set affinity/anti-affinity rules) +An additional affinity rule like: +```yaml +spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchFields: + - key: metadata.name + operator: In + values: + - +``` +is appended/merged to the list of affinity/anti-affinity rules. + +The additional affinity rule will cause the target virt-launcher pod to be schedulable only of the named node, if other pre-existing rules are preventing that, the target pod will not be schedulable and the migration will fail. + +#### pro +- user set affinity/anti-affinity rules are still also enforced +- additional constraints like `NoSchedule` taints are still enforced if not explicitly tolerated at VM level. + +#### cons +- if at least one of the pre-existing scheduling constraints set by the VM owner on the VM is preventing it from being scheduled on the named node, the only remaining option for the cluster admin to live migrate it to the named node is still directly amending the conflicting rule on the VM object. + +## Alternative design +One of the main reason behind this proposal is for improving the UX making it simpler and better defining boundaries between what is long-term placement requirement and what should simply be tried for this specific migration attempt. +According to: +https://kubevirt.io/user-guide/compute/node_assignment/#live-update +changes to a VM's node selector or affinities for a VM with LiveUpdate rollout strategy are now dynamically propagated to the VMI. + +This means that, only for VMs with LiveUpdate rollout strategy, we can already force the target for a live migration with something like: +- set a (temporary?) nodeSelector/affinity on the VM +- wait for it to be propagated to the VMI due to LiveUpdate rollout strategy +- trigger a live migration with existing APIs (no need for any code change) +- wait for the migration to complete +- (eventually) remove the (temporary?) nodeSelector to let the VM be freely migrate to any node in the future + +Such a flow can already be implemented today with a pipeline or directly from a client like `virtctl` without any backend change. +The drawback of that strategy is that we should tolerate having the spec of the VM amended twice with an unclear boundary about what was asked by the VM owner for long-lasting application specific reasons and what is required by a maintenance operator just for a specific migration attempt. + +## API Examples +```yaml +apiVersion: kubevirt.io/v1 +kind: VirtualMachineInstanceMigration +metadata: + name: migration-job +spec: + vmiName: vmi-fedora + nodeName: my-new-target-node +``` + +## Scalability +Forcing a `nodeName` on `VirtualMachineInstanceMigration` will cause it to be propagated to the destination virt-launcher pod. Having a `nodeName` on a pod will bypass the k8s scheduler and this could potentially lead to an unbalanced node placement across the nodes. +But the same result can be already achieved today specifying a `nodeSelector` or `affinity` and `anti-affinity` rules on a VM. + +## Update/Rollback Compatibility +`nodeName` on `VirtualMachineInstanceMigration` will be only an optional field so no impact in terms of update compatibility. + +## Functional Testing Approach +- positive test 1: a VirtualMachineInstanceMigration with an explict nodeName pointing to a node able to accommodate the VM should succeed +- positive test 2: a VirtualMachineInstanceMigration with an explict nodeName pointing to a node able to accommodate the VM but not matching a nodeSelector already present on the VM should succeed +- negative test 1: a VirtualMachineInstanceMigration with an explict nodeName should be refused if the required node doesn't exist +- negative test 2: a VirtualMachineInstanceMigration with an explict nodeName should be refused if the VM is already running on the requested node +- negative test 3: a VirtualMachineInstanceMigration with an explict nodeName should be refused if the user is not allowed to list nodes in the cluster +- negative test 4: a VirtualMachineInstanceMigration with an explict nodeName should fail if the selected target node is not able to accommodate the additional pod for virt-launcher + +# Implementation Phases +A really close attempt was already tried in the past with https://github.com/kubevirt/kubevirt/pull/10712 but the PR got some pushbacks because it was not clear why a new API for one-off migration is needed. +We give here a better explanation why this one-off migration destination request is necessary. +Once this design proposal is agreed-on, a similar PR should be reopened, refined, and we should implement functional tests.