-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
network-binding-plugin: add plugin for vhostuser interfaces. #294
base: main
Are you sure you want to change the base?
Changes from 1 commit
b4a4d97
b3d7cb7
3d900ec
7cbf677
01dfcc6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,141 @@ | ||||||
# Overview | ||||||
`vhostuser` interfaces are supported by qemu but not implemented in Kubevirt. Network Binding Plugin is a good framework to add support for `vhostuser` interfaces to Kubervirt. | ||||||
|
||||||
## Motivation | ||||||
`vhostuser` interfaces are required to attach VMs to a userspace dataplane such as OVS-DPDK or VPP and achieve a fast datapath from the VM to the physical NIC. | ||||||
This is a mandatory feature for networking VMs such as vRouter, IPSEC gateways, firewall or SD-WAN VNFs, that usually bind the network interfaces using DPDK. Expected performance with DPDK can only be met if the whole datapath is userspace and not go through kernel interfaces like with usual bridge interfaces. | ||||||
|
||||||
## Goals | ||||||
Be able to add `vhostuser` secondary interfaces to the VM definition in Kubevirt. | ||||||
|
||||||
## Non Goals | ||||||
The `vhostuser` secondary interfaces configuration in the dataplane is under the responsibility of Multus and the CNI such as `userspace CNI`. | ||||||
|
||||||
## Definition of Users | ||||||
Users of the feature are everyone that deploys a VM. | ||||||
|
||||||
## User Stories | ||||||
- As a user, I want to create a VM with one or serveral `vhostuser` interfaces attached to a userspace dataplane. | ||||||
- As a user, I want the `vhostuser` interface to be configured with a specific MAC address. | ||||||
- As a user, I want to enable multi-queue on the `vhostuser` interface | ||||||
- As a Network Binding Plugin developper, I want the shared socket path to be accessible to virt-launcher pod | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. typo
Suggested change
|
||||||
- As a CNI developper, I want to access the shared vhostuser sockets from Multus pod | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this one is also under |
||||||
|
||||||
## Repos | ||||||
Kubevirt repo, and most specificaly cmd/sidecars. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you please provide links for clarity? |
||||||
|
||||||
## Design | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which kind of networks the plugin is going to support? (pod network, secondary networks) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The plugin is designed to support secondary networks. |
||||||
This proposal leverages the Kubevirt Network Binding Plugin sidecar framework to implement a new `network-vhostuser-binding`. | ||||||
|
||||||
`network-vhostuser-binding` role is to implement the modification to the domain XML according to the VMI definition passed through its gRPC service by the `virt-launcher` pod. | ||||||
|
||||||
`vhostuser` interfaces are defined in the VMI under `spec/domain/devices/interfaces` using the binding name `vhostuser`: | ||||||
|
||||||
```yaml | ||||||
spec: | ||||||
domain: | ||||||
devices: | ||||||
networkInterfaceMultiqueue: true | ||||||
interfaces: | ||||||
- name: default | ||||||
masquerade: {} | ||||||
- name: net1 | ||||||
binding: | ||||||
name: vhostuser | ||||||
macAddress: ca:fe:ca:fe:42:42 | ||||||
``` | ||||||
|
||||||
`network-vhostuser-binding` translates the VMI definition into libvirt domain XML modifications: | ||||||
1. Creates a new interface with `type='vhostuser'` | ||||||
2. Set the MAC address if specified in the VMI spec | ||||||
3. If `networkInterfaceMultiqueue` is set to `true`, add the number of queues calculated after the number of cores of the VMI | ||||||
4. Add `memAccess='shared'` to all NUMA cells elements | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is worth adding a note that the change needs to be idempotent. To clarify, the hook is called multiple times and the result needs to be consistent between the first and following changes. Also, I am unsure what are the side-effects of such a marking. Please share how such a change will influence the VM. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi Ed, You mean the hook is called one time for each device using the same binding plugin? If so yes we need to add some words about that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Just assume that it can be called many times, you should not assume how many times. In a flow with failures, reconciliation may cause it to be called several times. In the future, other hook points in the code flow may trigger it. |
||||||
5. Define the device name according to Kubevirt naming schema | ||||||
6. Define the `vhostuser` socket path | ||||||
|
||||||
Below is an example of modified domain XML: | ||||||
|
||||||
```xml | ||||||
<cpu mode="host-model"> | ||||||
<topology sockets="2" cores="8" threads="1"></topology> | ||||||
<numa> | ||||||
<cell id="0" cpus="0-7" memory="2097152" unit="KiB" memAccess="shared"/> | ||||||
<cell id="1" cpus="8-15" memory="2097152" unit="KiB" memAccess="shared"/> | ||||||
</numa> | ||||||
</cpu> | ||||||
<interface type='vhostuser'> | ||||||
<source type='unix' path='/var/run/kubevirt/sockets/poda08a0fcbdea' mode='server'/> | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder, does the socket is created by the CNI inside virt-launcher pod? If so, I think the socket path can be reflected on Multus network-status annotation under the subject element device-info, see https://github.com/k8snetworkplumbingwg/device-info-spec/blob/main/SPEC.md#315-vhost-user The vhost-user device-info can be exposed using the network binding-plugin API's downardAPI. Please see vDPA example for reference how downardAPI can be used. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the hints. Indeed the new network binding plugin downardAPI will be helpful! |
||||||
<target dev='poda08a0fcbdea'/> | ||||||
<model type='virtio-non-transitional'/> | ||||||
<mac address='ca:fe:ca:fe:42:42'/> | ||||||
<driver name='vhost' queues='8' rx_queue_size='1024' tx_queue_size='1024'/> | ||||||
<alias name='ua-net1'/> | ||||||
</interface> | ||||||
``` | ||||||
|
||||||
This design leverages the existing `sockets` emptyDir mounted in `/var/run/kubevirt/sockets`. This allows the CNI to bind mount the socket emptyDir (`/var/lib/kubelet/<pod uid>/volumes/kubernetes.io~empty-dir/sockets`) to a host directory available to the dataplane pod through a hostPath mount. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm a bit confused with the changes this proposals has passed through. Regarding this intro sentence:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Indeed CNI needs to access api-server. Multus 3 allowed that through its own kubeconfig.multus. With Multus 4 thick mode, it's no longer possible, and CNI requesting api-server access needs to handle their own credentials creation.
Yes it is, as far the filesystem permissions AND SElinux allows it.
It can. vhostusers sockets are created in server mode, and consumed in client mode. Usually qemu is in server mode, and the data plane the client. By the way the other way is getting deprecated in ovs-dpdk for example. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the clarification. If the mount path is known in advance, I think we can find alternatives. |
||||||
|
||||||
However this assumes: | ||||||
- the `sockets` emptyDir can be used for such a purpose | ||||||
- the CNI has access to the `sockets` emptyDir of the `virt-launcher` pod and that it can bind mount it to a path available to the dataplane. This can be tricky especially with Multus 4 in thick plugin mode, where the CNI is executed by the mutlus pod. Usually the multus thick plugin daemonset defines a `/hostroot` hostPath volume mount with `mountPropagation: HostToContainer` option. A `mountPropagation: Bidirectional` option is needed for the bind mount to be propagated back to the host and to the dataplane pod. | ||||||
- the dataplane pod has the privilege to mount a hostPath | ||||||
|
||||||
Here is a diagram showing sockets sharing mecanisms between `virt-launcher` pod, `userspace CNI` and the `dataplane` pod. | ||||||
|
||||||
![kubevirt-vhostuser-shared-sockets](kubevirt-vhostuser-shared-sockets.png) | ||||||
|
||||||
Sharing the `vhostuser` sockets between `virt-launcher` pods and the dataplane pod is something to be enhanced in order to limit usage of hostPath volumes and bind mounts. | ||||||
|
||||||
## Alternative designs | ||||||
|
||||||
Some alternative designs were discussed in ![kubevirt-dev mailing-list](https://groups.google.com/g/kubevirt-dev/c/3w_WStrJfZw/m/yWSBpDAKAQAJ). | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please summarize them in an appendix so we can see the full picture in one place. |
||||||
|
||||||
### Expose a virt-launcher pod directory to binding plugin | ||||||
|
||||||
This requires to implement a new network binding plugin mechanism we could expose the content of a `virt-launcher` directory to an external plugin. | ||||||
The plugin registration in Kubervirt resource would define the target directory on the node where the directory should be exposed. | ||||||
|
||||||
This diagram explains this mechanism. | ||||||
|
||||||
![kubevirt-plugin-extension](kubevirt-plugin-extension.png) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do not understand what is suggested here.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I guess both virt-launcher and virt-handler needs to be clarified when it comes to network binding plugins.
In that case it'd be a new network binding plugin spec parameter.
Indeed, the CNI needs to know where the socket is located to configure the data plane with this socket. In current userspace CNI implementation, the CNI does a bind mount of the socket into data plane filesystem namespace. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This part I do not understand. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry I meant "the CNI does a bind mount of directory where the socket will be created"... |
||||||
|
||||||
The advantages of such an approach are: | ||||||
- is generic and could also be potentially reused by other device types | ||||||
- hides the KubeVirt implementation details. Currently, you need to know where the KubeVirt sockets are located in the virt-launcher filesystem. Potentially, if we change the directory path for the sockets, this would break the CNI plugin | ||||||
- can isolate the resources dedicated to that particular plugin | ||||||
|
||||||
The drawback side is that `virt-launcher` pod would need to be run with `privileged: true` in order to do the bind mount. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this is true. The directory can be exposed by virt-handler. Hence, no need of privileged for virt-launcher. This would also be unfeasible since virt-launcher is untrusted There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see very well the interaction between virt-handler, virt-launcher and the exposed directories. Need to check that. |
||||||
|
||||||
### Device plugin for `vhostuser sockets` resource | ||||||
|
||||||
Device plugins have the ability to add hostPath mounts to the pods when they request for managed resources. | ||||||
We could implement a vhostuser device plugin that would manage two kinds of resources: | ||||||
- dataplane: 1 | ||||||
This only resource is resquested by the userspace dataplane, and add a `/var/run/vhost_sockets` mount to the dataplane pod. | ||||||
bgaussen marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
- vhostuser sockets: n | ||||||
This as many resources as we want to handle, is requested by the `virt-launcher` pod using vhostuser plugin. This makes the device plugin create a per pod directory like `/var/run/vhost_sockets/<launcher-id>`, and mount it into the `virt-launcher` pod. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have you tried this? I still worry that the directory created by the dp will still have the original issue with the mount propagation HostToContainer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The only bind mount needed is the one the DP will push through kubelet. As there will be no further bind mount inside the pods, and as it will not be necessary for the CNI to do it neither, there should be no issue with mountPropagation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Still I'm not sure how kubelet mounts the directory inside the pod. It might be possible that it is with HostToContainer and the socket won't be visible in the host directory. I would really like to check it with real code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I understand your doubts ;) So I tested the scenario with generic-device-plugin, and the following resources definition: - --domain
- dataplane.io
- --device
- |
name: dataplane
groups:
- count: 1
paths:
- path: /var/lib/sockets
mountPath: /var/lib/sockets
type: Mount
permissions: mrw
- --device
- |
name: sockets
groups:
- paths:
- path: /var/lib/sockets/pod*
mountPath: /var/lib/socket
type: Mount
permissions: mrw When a pod's container requests a If we have a dataplane pod requesting I tested with only unprivileged pods. By the way I checked the propagation option of the target mount, it's the default private. / # findmnt -o TARGET,PROPAGATION /var/lib/one-socket
TARGET PROPAGATION
/var/lib/one-socket private But as far as we don't create new mount in that target, there is no propagation issue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the pod ID know by the CNI plugin or outside KubeVirt? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. DP can push annotation to the pod with an |
||||||
|
||||||
This solution allows both dataplane and vm pods to share the vhostuser sockets without requiring to be privileged, and without the need for the CNI to do some bind mounts and avoid the constraints on Multus mountPropagation option. | ||||||
|
||||||
We still have to care about directory and sockets permission (and SELinux categories?). | ||||||
|
||||||
## API Examples | ||||||
(tangible API examples used for discussion) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please. still add here the examples, it helps with the reviews and when this proposal is merged |
||||||
|
||||||
## Scalability | ||||||
(overview of how the design scales) | ||||||
|
||||||
## Update/Rollback Compatibility | ||||||
Kubevirt Network Binding plugin relies on `hooks/v1alpha3` API for a clean termination of the `network-vhostuser-binding` container in the virt-launcher pod. | ||||||
|
||||||
## Functional Testing Approach | ||||||
Create a VM with several `vhostuser` interfaces then: | ||||||
- check the generated domain XML contains all interfaces with appropriate configuration | ||||||
- check the vhostuser sockets are created in the expected directory of virt-launcher pod | ||||||
- check the vhostuser sockets are available to the dataplane pod | ||||||
- check the VM is running | ||||||
|
||||||
# Implementation Phases | ||||||
1. First implementation done | ||||||
2. Iterate on design issues regarding socket sharing | ||||||
3. Upstream `network-vhostuser-binding` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we have these basic users:
In this case, I think you also have: