Optimize counter polling interval by making it more accurate #1457

stephenxs · 2024-11-08T14:37:44Z

What I did

Optimize the counter-polling performance in terms of polling interval accuracy

Enable bulk counter-polling to run at a smaller chunk size
There is one counter-polling thread for each counter group. All such threads can compete for the critical sections at the vendor SAI level, which means a counter-polling thread can wait for a critical section if another thread has been in it, which introduces latency for the waiting counter group.
An example is the competition between the PFC watchdog and the port counter groups.
The port counter group contains many counters and is polled in a bulk mode which takes a relatively longer time. The PFC watchdog counter group contains only a few counters but is polled at a short interval. Sometimes, PFC watchdog counters need to wait before polling, which makes the polling interval inaccurate and prevents the PFC storm from being detected in time.
To resolve this issue, we can reduce the chunk size of the port counter group. The port counter group polls the counters of all ports in a single bulk operation by default. By using a smaller chunk size, it polls the counters in several bulk operations with each polling counter of a subset (whose size <= chunk size) of all ports.
By doing so, the port counter group stays in the critical section for a shorter time and the PFC watchdog is more likely to be scheduled to poll counters and detect the PFC storm in time.
Collect the time stamp immediately after vendor SAI API returns.
Currently, many counter groups require a Lua plugin to execute based on polling interval, to calculate rates, detect certain events, etc.
Eg. For PFC watchdog counter group to PFC storm. In this case, the polling interval is calculated based on the difference of time stamps between the current and last poll to avoid deviation due to scheduling latency. However, the timestamp is collected in the Lua plugin which is several steps after the SAI API returns and is executed in a different context (redis-server). Both introduce even larger deviations. To overcome this, we collect the timestamp immediately after the SAI API returns.

Why I did it

How I verified it

Run regression test and observe counter-polling performance.

A comparison test shows very good results if we put any/or all of the above optimizations.

Details if related

For 2, each counter group contains more than one counter context based on the type of objects. counter context is mapped from (group, object type). But the counters fetched from different counter groups will be pushed into the same entry for the same objects.
eg. PFC_WD group contains counters of ports and queues. PORT group contains counters of ports. QUEUE_STAT group contains counters of queues.
Both PFC_WD and PORT groups will push counter data into an item representing a port. but each counter has its own polling interval, which means counter IDs polled from different counter groups can be polled with different time stamps.
We use the name of a counter group to identify the time stamp of the counter group.
Eg. In port counter entry, PORT_timestamp represents last time when the port counter group polls the counters. PFC_WD_timestamp represents the last time when the PFC watchdog counter group polls the counters

stephenxs · 2024-11-08T14:38:50Z

This PR requires swss to be updated correspondingly. The swss PR will be opened soon.

stephenxs · 2024-11-11T07:45:37Z

Depends on sonic-net/sonic-swss-common#950

stephenxs · 2024-12-10T23:32:27Z

HLD sonic-net/SONiC#1864

mssonicbld · 2024-12-13T06:54:21Z

/azp run

azure-pipelines · 2024-12-13T06:54:33Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2024-12-13T09:26:43Z

/azp run

azure-pipelines · 2024-12-13T09:26:55Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2024-12-18T10:33:50Z

/azp run

azure-pipelines · 2024-12-18T10:34:01Z

Azure Pipelines successfully started running 1 pipeline(s).

kcudnik · 2024-12-24T13:18:40Z

/azp run

azure-pipelines · 2024-12-24T13:18:51Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2024-12-29T02:15:18Z

/azp run

azure-pipelines · 2024-12-29T02:15:29Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2024-12-29T02:16:48Z

/azp run

azure-pipelines · 2024-12-29T02:16:58Z

Azure Pipelines successfully started running 1 pipeline(s).

syncd/FlexCounter.cpp

mssonicbld · 2024-12-31T00:24:05Z

/azp run

azure-pipelines · 2024-12-31T00:24:16Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-01-02T08:28:08Z

/azp run

azure-pipelines · 2025-01-02T08:28:19Z

Azure Pipelines successfully started running 1 pipeline(s).

stephenxs · 2025-01-06T21:56:37Z

@kcudnik Would you help to review the PR?
Thanks

kcudnik · 2025-01-08T09:48:35Z

please fix build

stephenxs · 2025-01-08T09:57:10Z

please fix build

I triggered a build a week ago but it fetches swss-common from build #728172 which was built more than two weeks ago and my commit wasn't in. Retriggering it.

==============================================================================
Download from the specified build: #728172
Download artifact to: /home/vsts/work/1/a/download
Using default max parallelism.

stephenxs · 2025-01-08T09:57:31Z

/azpw run

mssonicbld · 2025-01-08T09:57:34Z

/AzurePipelines run

azure-pipelines · 2025-01-08T09:57:44Z

Azure Pipelines successfully started running 1 pipeline(s).

kcudnik · 2025-01-08T10:04:41Z

maybe it would require some empty commit to add to this PR to trigger fetch new swss common

stephenxs · 2025-01-08T10:06:28Z

maybe it would require some empty commit to add to this PR to trigger fetch new swss common

will do a force push

Signed-off-by: Stephen Sun <[email protected]>

mssonicbld · 2025-01-08T10:14:36Z

/azp run

azure-pipelines · 2025-01-08T10:14:48Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Stephen Sun <[email protected]>

mssonicbld · 2025-01-09T07:03:02Z

/azp run

azure-pipelines · 2025-01-09T07:03:14Z

Azure Pipelines successfully started running 1 pipeline(s).

stephenxs force-pushed the counter-optimization-all-in-one branch from c14fd22 to 6b362f6 Compare November 18, 2024 02:15

stephenxs marked this pull request as ready for review November 25, 2024 12:06

stephenxs force-pushed the counter-optimization-all-in-one branch from 6b362f6 to b82e233 Compare November 25, 2024 12:06

stephenxs force-pushed the counter-optimization-all-in-one branch from b82e233 to 5442b37 Compare December 13, 2024 06:54

stephenxs added Request for 202405 Branch Request for 202411 Branch labels Dec 13, 2024

stephenxs force-pushed the counter-optimization-all-in-one branch from 5442b37 to 6c1086a Compare December 13, 2024 09:26

stephenxs mentioned this pull request Dec 13, 2024

Make the counter polling interval accurate by setting bulk counter poll chunk size per group or per counter sonic-net/SONiC#1864

Open

mssonicbld added the Cherry Pick Conflict_202405 label Dec 18, 2024

stephenxs requested a review from kcudnik December 20, 2024 22:26

liat-grozovik removed Cherry Pick Conflict_202405 Request for 202405 Branch labels Dec 24, 2024

r12f added the Request for msft-202412 Branch label Dec 27, 2024

stephenxs force-pushed the counter-optimization-all-in-one branch from 9624365 to 677bfc2 Compare December 29, 2024 02:15

github-advanced-security bot found potential problems Dec 29, 2024

View reviewed changes

syncd/FlexCounter.cpp Fixed Show fixed Hide fixed

Optimize bulk counter to make the polling interval accurate

44a3c65

Signed-off-by: Stephen Sun <[email protected]>

stephenxs force-pushed the counter-optimization-all-in-one branch from 57c81d5 to 44a3c65 Compare January 8, 2025 10:13

Update log severity before and after bulk counter polling to info

b128298

Signed-off-by: Stephen Sun <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize counter polling interval by making it more accurate #1457

Optimize counter polling interval by making it more accurate #1457

stephenxs commented Nov 8, 2024

stephenxs commented Nov 8, 2024

stephenxs commented Nov 11, 2024

stephenxs commented Dec 10, 2024

mssonicbld commented Dec 13, 2024

azure-pipelines bot commented Dec 13, 2024

mssonicbld commented Dec 13, 2024

azure-pipelines bot commented Dec 13, 2024

mssonicbld commented Dec 18, 2024

azure-pipelines bot commented Dec 18, 2024

kcudnik commented Dec 24, 2024

azure-pipelines bot commented Dec 24, 2024

mssonicbld commented Dec 29, 2024

azure-pipelines bot commented Dec 29, 2024

mssonicbld commented Dec 29, 2024

azure-pipelines bot commented Dec 29, 2024

mssonicbld commented Dec 31, 2024

azure-pipelines bot commented Dec 31, 2024

mssonicbld commented Jan 2, 2025

azure-pipelines bot commented Jan 2, 2025

stephenxs commented Jan 6, 2025

kcudnik commented Jan 8, 2025

stephenxs commented Jan 8, 2025

stephenxs commented Jan 8, 2025

mssonicbld commented Jan 8, 2025

azure-pipelines bot commented Jan 8, 2025

kcudnik commented Jan 8, 2025

stephenxs commented Jan 8, 2025

mssonicbld commented Jan 8, 2025

azure-pipelines bot commented Jan 8, 2025

mssonicbld commented Jan 9, 2025

azure-pipelines bot commented Jan 9, 2025

Optimize counter polling interval by making it more accurate #1457

Are you sure you want to change the base?

Optimize counter polling interval by making it more accurate #1457

Conversation

stephenxs commented Nov 8, 2024

stephenxs commented Nov 8, 2024

stephenxs commented Nov 11, 2024

stephenxs commented Dec 10, 2024

mssonicbld commented Dec 13, 2024

azure-pipelines bot commented Dec 13, 2024

mssonicbld commented Dec 13, 2024

azure-pipelines bot commented Dec 13, 2024

mssonicbld commented Dec 18, 2024

azure-pipelines bot commented Dec 18, 2024

kcudnik commented Dec 24, 2024

azure-pipelines bot commented Dec 24, 2024

mssonicbld commented Dec 29, 2024

azure-pipelines bot commented Dec 29, 2024

mssonicbld commented Dec 29, 2024

azure-pipelines bot commented Dec 29, 2024

mssonicbld commented Dec 31, 2024

azure-pipelines bot commented Dec 31, 2024

mssonicbld commented Jan 2, 2025

azure-pipelines bot commented Jan 2, 2025

stephenxs commented Jan 6, 2025

kcudnik commented Jan 8, 2025

stephenxs commented Jan 8, 2025

stephenxs commented Jan 8, 2025

mssonicbld commented Jan 8, 2025

azure-pipelines bot commented Jan 8, 2025

kcudnik commented Jan 8, 2025

stephenxs commented Jan 8, 2025

mssonicbld commented Jan 8, 2025

azure-pipelines bot commented Jan 8, 2025

mssonicbld commented Jan 9, 2025

azure-pipelines bot commented Jan 9, 2025