This project fork from openshift/sriov-network-operator, we made some optimization to adapt rdma and kube-ovn's ovs offload function, and modified some bugs in the production environment
The Sriov Network Operator is designed to help the user to provision and configure SR-IOV nic and Device plugin in kubernetes cluster.
SR-IOV network is an optional feature of an kubernetes cluster. To make it work, it requires different components to be provisioned and configured accordingly. It makes sense to have one operator to coordinate those relevant components in one place, instead of having them managed by different operators. And also, to hide the complexity, we should provide an elegant user interface to simplify the process of enabling SR-IOV.
- Provision configurations in sriov-configmap for each node to claim the expected SR-IOV information about the node, such as the NIC name, NIC type, expected VF quantity
- Run sriov-netowork-operator on each node as daemonset to automatically configure SR-IOV information for that node according to configmap
- Enable IOMMU kernel parameters, and load the VFIO PCI kernel driver
- List/Watch sriov-configmap and load the configurations
- Generate/Execute configuration scripts to create resources, such as vf
- Evict Pod and Restart node to take effect the SR-IOV configurations
For kube-ovn offload more detail on installing this operator, refer to the quick-start guide.
The SR-IOV network operator introduces following new CRDs:
-
SriovNetwork
-
SriovNetworkNodeState
-
SriovNetworkNodePolicy
The custom resource to represent the SR-IOV interface states of each host, which should only be managed by the operator itself.
- The ‘spec’ of this CR represents the desired configuration which should be applied to the interfaces and SR-IOV device plugin.
- The ‘status’ contains current states of those PFs (baremetal only), and the states of the VFs. It helps user to discover SR-IOV network hardware on node, or attached VFs in the case of a virtual deployment.
The spec is rendered by sriov-policy-controller, and consumed by sriov-config-daemon. Sriov-config-daemon is responsible for updating the ‘status’ field to reflect the latest status, this information can be used as input to create SriovNetworkNodePolicy CR.
An example of SriovNetworkNodeState CR:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodeState
metadata:
name: node1
namespace: kube-system
spec:
interfaces:
- eSwitchMode: switchdev
name: ens41f0np0
numVfs: 3
pciAddress: 0000:5f:00.0
vfGroups:
- policyName: policy
vfRange: 0-2
resourceName: cx_sriov_switchdev
- eSwitchMode: switchdev
name: ens41f1np1
numVfs: 3
pciAddress: 0000:5f:00.1
vfGroups:
- policyName: policy
vfRange: 0-2
resourceName: cx_sriov_switchdev
status:
interfaces
- Vfs:
- deviceID: 1018
driver: mlx5_core
pciAddress: 0000:5f:00.2
vendor: "15b3"
- deviceID: 1018
driver: mlx5_core
pciAddress: 0000:5f:00.3
vendor: "15b3"
- deviceID: 1018
driver: mlx5_core
pciAddress: 0000:5f:00.4
vendor: "15b3"
deviceID: "1017"
driver: mlx5_core
linkSeed: 25000Mb/s
linkType: ETH
mac: 08:c0:eb:f4:85:ab
mtu: 1500
name: ens41f0np0
numVfs: 3
pciAddress: 0000:5f:00.0
totalvfs: 3
vendor: "15b3"
- Vfs:
- deviceID: 1018
driver: mlx5_core
pciAddress: 0000:5f:00.5
vendor: "15b3"
- deviceID: 1018
driver: mlx5_core
pciAddress: 0000:5f:00.6
vendor: "15b3"
- deviceID: 1018
driver: mlx5_core
pciAddress: 0000:5f:00.7
vendor: "15b3"
deviceID: "1017"
driver: mlx5_core
linkSeed: 25000Mb/s
linkType: ETH
mac: 08:c0:eb:f4:85:bb
mtu: 1500
name: ens41f1np1
numVfs: 3
pciAddress: 0000:5f:00.1
totalvfs: 3
vendor: "15b3"
From this example, in status field, the user can find out there are 2 SRIOV capable NICs on node 'node1'; in spec field, user can learn what the expected configure is generated from the combination of SriovNetworkNodePolicy CRs. In the virtual deployment case, a single VF will be associated with each device.
This CRD is the key of SR-IOV network operator. This custom resource should be managed by cluster admin, to instruct the operator to:
- Render the spec of SriovNetworkNodeState CR for selected node, to configure the SR-IOV interfaces. In virtual deployment, the VF interface is read-only.
- Generate the configuration of SR-IOV device plugin.
An example of SriovNetworkNodePolicy CR:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: policy
namespace: kube-system
spec:
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
eSwitchMode: switchdev
numVfs: 3
nicSelector:
pfNames:
- ens41f0np0
- ens41f1np1
resourceName: cx_sriov_switchdev
In this example, user selected the nic ens41f0np0 and ens41f1np1, on nodes labeled with 'network-sriov.capable' equals 'true'. Then for those PFs, load 3 VFs each to those virtual functions.
In a virtual deployment:
- The mtu of the PF is set by the underlying virtualization platform and cannot be changed by the sriov-network-operator.
- The numVfs parameter has no effect as there is always 1 VF
- The deviceType field depends upon whether the underlying device/driver is native-bifurcating or non-bifurcating For example, the supported Mellanox devices support native-bifurcating drivers and therefore deviceType should be netdevice (default).
When multiple SriovNetworkNodeConfigPolicy CRs are present, the priority
field
(0 is the highest priority) is used to resolve any conflicts. Conflicts occur
only when same PF is referenced by multiple policies. The final desired
configuration is saved in SriovNetworkNodeState.spec.interfaces
.
Policies processing order is based on priority (lowest first), followed by name
field (starting from a
). Policies with same priority or non-overlapping
VF groups (when #-notation is used in pfName field) are merged, otherwise only
the highest priority policy is applied. In case of same-priority policies and
overlapping VF groups, only the last processed policy is applied.
This operator is split into 2 components:
- controller
- sriov-config-daemon
The controller is responsible for:
- Read the SriovNetworkNodePolicy CRs and SriovNetwork CRs as input.
- Render the spec of SriovNetworkNodeState CR for each node.
The sriov-config-daemon is responsible for:
- Discover the SRIOV NICs on each node, then sync the status of SriovNetworkNodeState CR.
- Take the spec of SriovNetworkNodeState CR as input to configure those NICs.