-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4962: Standardizing the Representation of Cluster Switch Network Topology #4965
base: master
Are you sure you want to change the base?
Conversation
Welcome @dmitsh! |
Hi @dmitsh. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cc @aojea |
@dmitsh: GitHub didn't allow me to request PR reviews from the following users: tardieu, arsenetar, brianhammons. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
||
### Network QoS Annotation | ||
Format: `network.qos.kubernetes.io/switches: <QoS>` | ||
- `<QoS>`: A JSON object where each key is a switch name (matching the network topology label) with a value containing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this object contains N items (of below structure), where N is the number of predefined topology units (accelerator, block, datacenter, zone), right?
What I want to ensure is that those needs to be changed/updated when the cluster grows/shrinks (i.e. that the don't define distance between nodes themselves). But given that for a given node its contents don't depend on other nodes (rather on placements in the physical network), this seems to be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t , that's correct, each node will contain QoS metrics between the node and every reachable switch.
/assign @johnbelamaric for sig-architecture |
- `<switch-name>`: Unique identifier for the switch | ||
|
||
### Network QoS Annotation | ||
Format: `network.qos.kubernetes.io/switches: <QoS>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really the switch, right, is the interface in the node that connects to the switch ... and we already have properties to define attributes on the network interfaces with DRA, see slice 14 https://docs.google.com/presentation/d/1Vdr7BhbYXeWjwmLjGmqnUkvJr_eOUdU0x-JxfXWxUT8/edit#slide=id.g2f750386db2_5_0 so I don't feel we need this additional annotation here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a NIC on the node. These are QoS metrics from the node to every reachable switch. Also, the "switch" in this context could be a physical network device, or an aggregated entity defined by a CSP. For example, AWS returns 3 levels of switches per node, but the actual number of physical switches is unknown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are QoS metrics from the node to every reachable switch
how is the Node connected to the first switch :) ?
network.qos.kubernetes.io/switches: { | ||
"nvl10": { | ||
"latency": "2us", | ||
"bandwidth": "100Gbps" | ||
}, | ||
"sw11": { | ||
"latency": "50us", | ||
"bandwidth": "40Gbps" | ||
}, | ||
"sw21": { | ||
"latency": "500us", | ||
"bandwidth": "20Gbps" | ||
}, | ||
"sw31": { | ||
"latency": "1ms", | ||
"bandwidth": "10Gbps" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are network interfaces on the node, I think we should better model them with DRA https://github.com/kubernetes/enhancements/pull/4965/files#r1846865095 , that also allows us to provide dynamic capabilities to the interfaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, as mentioned in the earlier comment, these QoS numbers represent node-to-reachable-switch metrics. They are not per NIC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it mean that some switches may not be connected directly?
how then the latency and bandwidth are obtained to guarantee those values?
Format: `network.topology.kubernetes.io/<nw-switch-type>: <switch-name>` | ||
- `<nw-switch-type>`: Logical type of the network switch (can be one of the reserved names or a custom name) | ||
- Reserved names: `accelerator`, `block`, `datacenter`, `zone` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the part we need to loop in SIG architecture, I briefly touched on this with @thockin and it seems it took some time to settle on the region/zone.
So my understanding is that we need to model a hierarchy, this KEP suggest to use nested structures: zone > datacenter > block > accelerator , but we should at least describe in alternative why weights are not better than this, it seems to me that weights are more easy to standardize different topologies, as you only focus on the distance provided between layers, and can support different architectures with multiple layers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This KEP is proposing to use reserved network types for typical network architectures, while allowing to extend network topology using custom network types.
We are providing means for weighted approach by specifying distance and/or bandwidth, latency or other metrics.
These are actual measurable physical characteristics of the network, and will be more accurate than specifying static weights.
Once again, we are providing QoS between a node and a switch, so the distance is the number of hops between the node and the switch. Same goes for bandwidth/latency.
This proposal is designed with extensibility in mind, enabling the use of custom network types. This ensures that the standard can adapt to future advancements in cluster networking without requiring significant overhauls. | ||
|
||
For custom network types, Network QoS Annotations are required, with distance being the minimum mandatory metric. Specifying latency and bandwidth is optional, but including them can offer a more detailed view of link performance, enabling more efficient scheduling decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUIC the use of custom network types means that environment that use them may not be compatible with other environments, that practically removes all the benefits of standarization, another thing where I think weighted/distances model can help better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was already addressed.
|
||
The same network topology depicted in Example 2 can be represented using custom network types. | ||
|
||
Let's use `tor` for top-of-rack switches, `area` for the second level of switches, and `center` for the third level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After seen this example I'm definitevely not in favor of custom types as it is impossible for a generic tool to infer the distance between these custom types ... it will also create fragmentation and incompatibility as multiple tools can define the same name with different meaning ... the example also talks about levels, that reinforces my idea of weights, so something like network.topology.kubernetes.io/tier: 1
In this example network.topology.kubernetes.io/tor: sw13
will be
- Node:
network.topology.kubernetes.io/tier: 1
- ResourceSlice:
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceSlice
…
spec:
devices:
- basic:
attributes:
tier:
int: 1
type:
string: nvlink
ip:
string: 10.1.1.1/24
latency:
string: "50us"
bandwith:
string: "100gbps"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using custom network types, it is mandatory to define distance in the QoS annotation.
We explicitly expressed that in the KEP.
- Gang-scheduling auto-scaler | ||
- DRA scheduler plugin | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there existing topology formats or consumers that we want Kubernetes to integrate with? If so, are these integrations goals or non-goals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the best of my knowledge, only EKS exposes their custom node labels for network layers.
The goal of this KEP is to create a standard way of describing switch network topology.
Ultimately, we want to use this standard in development of Kubernetes-native network-aware scheduler plugin for multi-node workloads. We put this task as a "no goal", as it would be an independent effort.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dmitsh The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that if I encountered this in my professional work (cloud consultancy), I'd recommend blocking the change.
Overall, I propose adding these annotations and other details to a new external project, using Kubernetes extensibility mechanisms.
If it's a really useful project, build it and donate it to the CNCF. We like working with other CNCF projects; it can be really positive to do that.
I don't see the case for the annotations and labels and conventions be part of Kubernetes and maintained as part of Kubernetes.
```yaml | ||
network.qos.kubernetes.io/switches: { | ||
"nvl10": { | ||
"latency": "2us", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not keen on "us" here. Either use quantities, or actually write μs, or use integer nanoseconds, or something.
"us" is a term that is hard for people to recognize and easy to confuse (it can be mistaken for the English word "us").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. This is mostly for an example of using arbitrary QoS metrics.
network.topology.kubernetes.io/block: sw13 | ||
network.topology.kubernetes.io/datacenter: sw22 | ||
network.topology.kubernetes.io/zone: sw31 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be FQDNs maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The switch ID could be a network abstraction, as in the case of AWS.
|
||
What is currently missing is a common and standard way to convey the switch network topology information to the Kubernetes environment. | ||
|
||
Currently, there is no standardized way to represent this information in Kubernetes, making it challenging to develop control plane components and applications that can leverage network topology awareness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but: why does this standardised way need to be part of Kubernetes itself? You need to explain that as well.
A viable alternative could be making something like NVIDIA topograph but vendor neutral, that isn't part of Kubernetes itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just updated the "Motivation" part. Hope it will bring more clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The end-goal is to use these labels in Kubernetes-native scheduler plugins.
Hence the proposal to make them consistent across CSPs and on-prem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so that needs to go into the KEP text. Without that detail, the motivation isn't obvious.
(I reread it, and that detail is there, but it wasn't obvious to me that the key motivation was to enable the scheduling work). Maybe it's worth capturing that a bit more. I agree that Kubernetes' own scheduler should be cautious about depending on the meaning of external or vendor-specific labels.
Signed-off-by: Dmitry Shmulevich <[email protected]>
Currently, Kubernetes lacks a unified standard for representing network topology. This gap creates challenges for developing control plane components and applications that could leverage topology-aware features. | ||
|
||
For example, AWS has begun addressing this by introducing `topology.k8s.aws/network-node-layer-N` node labels to represent its 3-tier networking structure. However, this solution is cloud-specific and does not address broader use cases. | ||
|
||
In this KEP, we propose establishing a standardized representation of switch network topology within Kubernetes clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but: why does this standardised way need to be part of Kubernetes itself? You need to explain that as well.
A viable alternative could be making something like NVIDIA topograph but vendor neutral, that isn't part of Kubernetes itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation here could also serve for a vendor-neutral alternative, not part of Kubernetes. For example, you could be proposing that https://opennetworking.org/ be the umbrella organization, not Kubernetes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of label standardization is to serve as a foundational step toward developing a network-aware gang scheduler for Kubernetes. By establishing a consistent way to represent cluster network topology, this initiative lays the groundwork for advanced scheduling capabilities that take network performance into account. Since this work directly enhances Kubernetes functionality, we submitted it specifically as a KEP (Kubernetes Enhancement Proposal).
Let me write my thoughts on this: I see value in a predefined set of labels that describe topology, similar to what we have with 1, It can be other names, but having a number of predefined levels standard across environments will benefit the projects that implement scheduling based on network topology, I think 4 levels should be ok for most environments. And now the friction points:
|
OK, but responding to #4965 (comment): if the labels (and annotations?) were inside If we can't articulate that, we should have pause. Maybe some other project should define the standard (not Kubernetes). |
<!-- | ||
Why should this KEP _not_ be implemented? | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will we describe multihomed nodes?
For example, a node directly connects to two neighbours, and to two top of rack switches. Pod to pod communication can use either best-effort routing or computed paths with source route metadata added to packets on egress.
Kubernetes doesn't say you can't do that, but this standard implies a hierarchical network layout. Are we tacitly backing a particular network management paradigm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<!-- | |
Why should this KEP _not_ be implemented? | |
--> | |
This proposal assumes that the network topology is hierarchical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not accurate.
You can easy describe a node connected to multiple switches:
A simple example, where we assume that performance of switches is comparable:
Labels:
network.topology.kubernetes.io/level_a: sw01
network.topology.kubernetes.io/level_b: sw02
Annotations:
network.qos.kubernetes.io/switches: {
"sw01": {
"distance": 1
},
"sw02": {
"distance": 1
}
}
An example, where we provide performance metrics:
Labels:
network.topology.kubernetes.io/level_a: sw01
network.topology.kubernetes.io/level_b: sw02
Annotations:
network.qos.kubernetes.io/switches: {
"sw01": {
"distance": 1,
"bandwidth": "40Gbps"
},
"sw02": {
"distance": 1,
"bandwidth": "60Gbps"
}
}
I do see value on So I think that aiming for something similar for the new type of workloads that require a more granular definition of network topology makes sense ... I really think we should work on what are the new |
Following up #4965 (comment) I still have concerns; I'll explain more. We can encourage cloud providers to define their own labels as well. I agree about the value So this KEP wouldn't persuade providers and / or hardware vendors to define their own parallel, provider-specific labels - but encouraging that kind of guidance would bring something on top of a common denominator of labels that Kubernetes does define. I remain worried about tacitly promoting one network paradigm (a switching hierarchy), especially for a project that might be around another 10 years. This KEP should articulate its alternatives and make clear why we picked the approach we select. |
agree, but I see this is about network hierarchy, switching hierarchy is just an implementation of a network hierarchy Level 1: there are things that are closest , this can be a switch a rack or the PCI bus, the implementer decide this |
I guess as an actual alternative, we could define an API for representing network topology. For example, custom resources:
Maybe also:
Remember, alternatives in KEP feedback don't have to be the choice we select. They can instead be viable options we rule out but still list as alternatives we considered. |
/ok-to-test |
@dmitsh: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
I want to stress that we are not advocating for or favoring any specific network hierarchy. On the contrary, our goal is to make the system flexible and adaptable to any type of network. To achieve this, we propose using reserved types for common configurations, enabling simplicity and brevity. Additionally, we recommend supporting custom types as a more generic solution, where key details like the number of hops (as a minimum input) and additional performance metrics can be specified. This information will be crucial for optimally scheduling multi-node training jobs or sets of interdependent, data-intensive services. It ensures that network-aware decisions can be made to enhance performance and efficiency. |
Standardizing Cluster Network Topology Representation