-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev port disaggregation #320
Conversation
linked document is not accessible to public |
@parthpower I have shared the document. |
flow/metrics.yaml
Outdated
@@ -30,6 +37,12 @@ components: | |||
Latency metrics. | |||
$ref: '#/components/schemas/Flow.Latency.Metrics' | |||
x-field-uid: 4 | |||
rx_per_port: | |||
description: |- | |||
Enables Rx port level disaggregation with metrics tag name set as "rx_port_name". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... metrics tag name automatically set by implementation as "rx_port_name" in the flow metric_response ( use exact attribute names from metric response )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also , do we need updates in otg gnmi model to specify that response can be filtered for specific port by specifing the filter for rx_port_name: in this exact format . Otherwise, how will gnmi user know what filter to use or expect which has to be parsed to set expectation in tests when checking stats using gnmi ( or otg metrics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For gNMI query path: “/flows/flow[name=f1]/”
expected gNMI response (for the metric response example in PR description):
"updates": [
{
"Path": "flows/flow[name=f1]",
"values": {
"flows/flow": {
"open-traffic-generator-flow:name": "f1",
"open-traffic-generator-flow:state": { // Contains the aggregated per-flow stats
"counters": {
"in-pkts": "1000",
"in-octets": "8000",
"out-pkts": "1000",
"out-octets": "8000"
},
"in-frame-rate": "QKAAAA==",
"name": "f1",
"out-frame-rate": "QCAAAA==",
"transmit": true
},
"open-traffic-generator-flow:tag-metrics": { // Contains the disaggregated per-flow stats
"tag-metric": [
{
"name-value": "rx_name=p2",
"state": {
"counters": {
"in-octets": "4000",
"in-pkts": "500"
},
"name-value": "rx_name=p2",
"tags": [
{
"tag-name": "rx_name",
"tag-value": {
"value-as-string": "p2",
"value-type": "STRING"
}
}
]
}
},
{
"name-value": "rx_name=p3",
"state": {
"counters": {
"in-octets": "4000",
"in-pkts": "500"
},
"name-value": "rx_name=p3",
"tags": [
{
"tag-name": "rx_name",
"tag-value": {
"value-as-string": "p3",
"value-type": "STRING"
}
}
]
}
}
]
}
}
}
}
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description modified to reflect Rx port name location in tagged response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a more general question. A device's ethernet
can have a connection
to a port
, lag
or vxlan
tunnel. As a result, a flow that is received by multiple devices can come through all of these types of connections. Shall we disaggregate Rx metrics accordingly, per lag
or vxlan
when configured as such, instead of port-level disaggregation?
flow/metrics.yaml
Outdated
@@ -30,6 +35,13 @@ components: | |||
Latency metrics. | |||
$ref: '#/components/schemas/Flow.Latency.Metrics' | |||
x-field-uid: 4 | |||
rx_per_port: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also have rx_per_lag
and rx_per_vxlan
? Or maybe if we want to support those in future, shall we do rx_per_connection
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we do not have support for drill down stats for lag or vxlan in TE. So we will support ports level disaggregation for now. But the model is very extensible and when we have the underlying support for those, we can extend the model without any breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you give an example of how the model would be extended? it would be good to see how that would be aligned with rx_per_port
parameter. i'm thinking maybe we should use choice
here and set it as port
first and then extend to lag
and vxlan
later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have restructured this part, now all additional tracking will come under "predefined_metric_tags" and we can add new properties under this. Current property "rx_name" will apply for both raw port and lag.
flow/metrics.yaml
Outdated
@@ -69,3 +81,44 @@ components: | |||
x-field-uid: 1 | |||
cut_through: | |||
x-field-uid: 2 | |||
Flow.TxRxReplication: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't Replication
imply additional copies? Meaning, if we have Replication = 1
it means we expect to receive twice more packets that we sent? Maybe we could use TxRxRatio
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I have updated it to TxRxRatio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we change the reference as well from $ref: '#/components/schemas/Flow.TxRxReplication'
to $ref: '#/components/schemas/Flow.TxRxRatio'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor typo.
flow/metrics.yaml
Outdated
x-field-uid: 3 | ||
Flow.RxTxRatio.RxCount: | ||
description: |- | ||
This is for cases where one copy each Tx packet is received on all Rx ports and so the sum total of Rx packets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one copy of
flow/endpoint.yaml
Outdated
type: array | ||
items: | ||
description: |- | ||
The unique name of a port that is the intended receive port. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we correct 'port' to 'port or lag'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not seeing the updated change?
flow/metrics.yaml
Outdated
x-field-uid: 2 | ||
value: | ||
description: |- | ||
Should be a positive, non-zero value. A custom integer value (>1) can be specified for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to explain the scenario with the default value (i.e. == 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added description for default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still thinking about GNMI outputs a bit here, but I'm wondering if we should just use the port name directly as the tag ("" vs. "rx_port_name=".) (It should be a unique name in the config, but I'm also not sure if there are any other future 'built-in' ones that could not be specified this way.)
Unrelated clarification question: These tags will be included in disaggregated GNMI stats in the same way/alongside the user-defined ones correct?
properties: | ||
rx_name: | ||
description: |- | ||
Enables Rx port or lag level disaggregation with predefined metrics tag name set as "rx_name". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need an update from rx_name->rx_port_name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both otg and gnmi tag name is going to be rx_name (rx_port_name was the earlier proposal). The rx_name is generic tag name, referring to both port and lag as rx endpoints.
So if we have a flow configured from lag L1 to lag L2 and a separate port P5, the metrics response will look like:
F1: L1 => L2, P5
Flow Frames Tx Bytes Tx Frames Rx Bytes Rx Tags
f1 1000 64000 1000 64000
f1 400 25600 rx_name: p5
f1 600 38400 rx_name: l2
There can be other built-in tag names in future, like "mem_name" (as an example), if we want to disaggregate based on lag member ports also.
And yes, these tag names will be alongside user defined ones.
format: float | ||
default: 1.0 | ||
x-field-uid: 3 | ||
Flow.RxTxRatio.RxCount: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my current understanding, I'm wondering if we need this option. I'm not sure it's more convenient/explicit for test writer to use vs. just setting the port count explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User can use RxCount, so even if the rx port count changes in config, user does not have to update it.
But I leave it to you, if you feel otherwise.
flow/endpoint.yaml
Outdated
type: array | ||
items: | ||
description: |- | ||
The unique name of a port that is the intended receive port. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not seeing the updated change?
Support to address issue #319 (Metrics response disaggregation based on Rx ports).
This is Rx port level disaggregation of metrics response required for multi Rx port device (and raw) traffic. Also addressed is support for specifying Rx/Tx frames ratio.
Example code snippet
For control plane BGP sessions configured on P1, P2 and P3 with device names
p1d1peer1rrv4, p2d1peer1rrv4 and p3d1peer1rrv4 respectively.
Following is the data plane configuration
We are expecting Tx and Rx packets to match in this example and hence the Tx Rx ratio is set to "one"
Query for metrics results
Note: The “rx_name” would be in-built tag name for Rx ports and cannot be used for any egress tracking name.