[dualtor] Improve `mux_simulator` #16164

lolyu · 2024-12-19T12:39:09Z

Description of PR

Summary:
Fixes # (issue)

Type of change

Bug fix
Testbed and Framework(new/improvement)
Test case(new/improvement)

Back port request

Approach

What is the motivation for this PR?

The dualtor nightly suffers from mux simulator timeout issue, and there are always HTTP timeout failures observed.
This PR tries to improve the mux simulator performance:

improve the all mux toggle performance.
improve the mux simulator read/write throughput.

PR 1522 was a quick fix to address, but it was a temporary quick fix.

How did you do it?

run mux simulator with gunicorn instead of its own built-in HTTP server.
The mux simulator is running with Flask's own built-in HTTP server. Previously, the mux simulator is running with single-threaded mode, which limits its performance && throughput; and the mux simulator is observed stuck in reading from dead connection; PR 1522 proposes a temporary by running mux simulator in threaded mode. The throughput is improved with the threaded approach, but the built-in server limits the tcp listen backlog to 128, and it is designed for development/test purpose and not recommended to be deployed(Flask's deployment doc).
So let's run the mux simulator with gunicorn:

better performance/throughput with customized worker count
increased tcp listen backlog

use thread pool to parallel the toggle requests.
The mux simulator handles the toggle-all request by toggling each mux port one by one, let's use a thread pool to parallel run thoses toggles to further decrease the response time.

How did you verify/test it?

Run the following benchmarks on a dualtor-120 testbed, and compare the performance of:

A: the original mux simulator, with Flask built-in server in single-thread mode.
B: the mux simulator with Flask built-in server in threaded mode.
C: the mux simulator with this PR.

toggle mux status for all mux ports(one request to toggle one mux port):

20 concurrent users, repeated 2000 times

mux simulator version	A	B	C
elapse time	96s	37s	36s

toggle mux status for all mux ports(one request to toggle all mux ports):

1 user, repeated 1 time.

mux simulator version	A	B	C
elapse time	16s	16s	7s

To summarize, mux simulator with this PR has the best performance in toggles.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Signed-off-by: Longxiang Lyu <[email protected]>

mssonicbld · 2024-12-19T12:39:11Z

/azp run

azure-pipelines · 2024-12-19T12:39:23Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot reviewed 2 out of 3 changed files in this pull request and generated no comments.

Files not reviewed (1)

ansible/roles/vm_set/templates/mux-simulator.service.j2: Language not supported

Comments suppressed due to low confidence (2)

ansible/roles/vm_set/files/mux_simulator.py:955

The variable 'default_handler' is not defined, which will raise a NameError. Define 'default_handler' before using it.

app.logger.removeHandler(default_handler)

ansible/roles/vm_set/files/mux_simulator.py:947

The new behavior introduced in the 'setup_mux_simulator' function is not covered by tests. Add tests to cover this new behavior.

def setup_mux_simulator(http_port, vm_set, verbose):

Signed-off-by: Longxiang <[email protected]>

mssonicbld · 2024-12-20T07:24:59Z

/azp run

azure-pipelines · 2024-12-20T07:25:11Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.

Files not reviewed (1)

ansible/roles/vm_set/templates/mux-simulator.service.j2: Language not supported

ansible/roles/vm_set/tasks/control_mux_simulator.yml

Signed-off-by: Longxiang <[email protected]>

mssonicbld · 2024-12-20T07:32:14Z

/azp run

azure-pipelines · 2024-12-20T07:32:25Z

Azure Pipelines successfully started running 1 pipeline(s).

What is the motivation for this PR? The dualtor nightly suffers from mux simulator timeout issue, and there are always HTTP timeout failures observed. This PR tries to improve the mux simulator performance: improve the all mux toggle performance. improve the mux simulator read/write throughput. PR 1522 was a quick fix to address, but it was a temporary quick fix. How did you do it? run mux simulator with gunicorn instead of its own built-in HTTP server. The mux simulator is running with Flask's own built-in HTTP server. Previously, the mux simulator is running with single-threaded mode, which limits its performance && throughput; and the mux simulator is observed stuck in reading from dead connection; PR 1522 proposes a temporary by running mux simulator in threaded mode. The throughput is improved with the threaded approach, but the built-in server limits the tcp listen backlog to 128, and it is designed for development/test purpose and not recommended to be deployed(Flask's deployment doc). So let's run the mux simulator with gunicorn: better performance/throughput with customized worker count increased tcp listen backlog use thread pool to parallel the toggle requests. The mux simulator handles the toggle-all request by toggling each mux port one by one, let's use a thread pool to parallel run thoses toggles to further decrease the response time. How did you verify/test it? Run the following benchmarks on a dualtor-120 testbed, and compare the performance of: A: the original mux simulator, with Flask built-in server in single-thread mode. B: the mux simulator with Flask built-in server in threaded mode. C: the mux simulator with this PR. toggle mux status for all mux ports(one request to toggle one mux port): 20 concurrent users, repeated 2000 times mux simulator version A B C elapse time 96s 37s 36s toggle mux status for all mux ports(one request to toggle all mux ports): 1 user, repeated 1 time. mux simulator version A B C elapse time 16s 16s 7s To summarize, mux simulator with this PR has the best performance in toggles. Any platform specific information? Supported testbed topology if it's a new test case? Signed-off-by: Longxiang Lyu <[email protected]>

mssonicbld · 2025-01-07T03:38:25Z

Cherry-pick PR to 202405: #16369

What is the motivation for this PR? The dualtor nightly suffers from mux simulator timeout issue, and there are always HTTP timeout failures observed. This PR tries to improve the mux simulator performance: improve the all mux toggle performance. improve the mux simulator read/write throughput. PR 1522 was a quick fix to address, but it was a temporary quick fix. How did you do it? run mux simulator with gunicorn instead of its own built-in HTTP server. The mux simulator is running with Flask's own built-in HTTP server. Previously, the mux simulator is running with single-threaded mode, which limits its performance && throughput; and the mux simulator is observed stuck in reading from dead connection; PR 1522 proposes a temporary by running mux simulator in threaded mode. The throughput is improved with the threaded approach, but the built-in server limits the tcp listen backlog to 128, and it is designed for development/test purpose and not recommended to be deployed(Flask's deployment doc). So let's run the mux simulator with gunicorn: better performance/throughput with customized worker count increased tcp listen backlog use thread pool to parallel the toggle requests. The mux simulator handles the toggle-all request by toggling each mux port one by one, let's use a thread pool to parallel run thoses toggles to further decrease the response time. How did you verify/test it? Run the following benchmarks on a dualtor-120 testbed, and compare the performance of: A: the original mux simulator, with Flask built-in server in single-thread mode. B: the mux simulator with Flask built-in server in threaded mode. C: the mux simulator with this PR. toggle mux status for all mux ports(one request to toggle one mux port): 20 concurrent users, repeated 2000 times mux simulator version A B C elapse time 96s 37s 36s toggle mux status for all mux ports(one request to toggle all mux ports): 1 user, repeated 1 time. mux simulator version A B C elapse time 16s 16s 7s To summarize, mux simulator with this PR has the best performance in toggles. Any platform specific information? Supported testbed topology if it's a new test case? Signed-off-by: Longxiang Lyu <[email protected]>

[dualtor] Improve mux_simulator

6dfcbe8

Signed-off-by: Longxiang Lyu <[email protected]>

lolyu requested a review from Copilot December 19, 2024 12:40

Copilot AI reviewed Dec 19, 2024

View reviewed changes

Use thread pool to parallel toggles

b29cfc7

Signed-off-by: Longxiang <[email protected]>

lolyu requested a review from Copilot December 20, 2024 07:25

Copilot AI reviewed Dec 20, 2024

View reviewed changes

ansible/roles/vm_set/tasks/control_mux_simulator.yml Show resolved Hide resolved

Set worker graceful restart to 60s

d8ad715

Signed-off-by: Longxiang <[email protected]>

lolyu marked this pull request as ready for review December 23, 2024 01:55

lolyu requested review from yxieca and wangxin December 23, 2024 03:15

wangxin approved these changes Dec 23, 2024

View reviewed changes

lolyu added Request for 202405 branch Request for 202411 branch labels Dec 24, 2024

wangxin merged commit 9f2412d into sonic-net:master Dec 25, 2024
19 checks passed

bingwang-ms added the Approved for 202405 branch label Jan 7, 2025

mssonicbld added the Created PR to 202405 branch label Jan 7, 2025

mssonicbld mentioned this pull request Jan 7, 2025

[action] [PR:16164] [dualtor] Improve mux_simulator #16369

Merged

8 tasks

mssonicbld added Included in 202405 branch and removed Created PR to 202405 branch labels Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dualtor] Improve `mux_simulator` #16164

[dualtor] Improve `mux_simulator` #16164

lolyu commented Dec 19, 2024 •

edited

Loading

mssonicbld commented Dec 19, 2024

azure-pipelines bot commented Dec 19, 2024

mssonicbld commented Dec 20, 2024

azure-pipelines bot commented Dec 20, 2024

Copilot AI left a comment

mssonicbld commented Dec 20, 2024

azure-pipelines bot commented Dec 20, 2024

mssonicbld commented Jan 7, 2025

[dualtor] Improve mux_simulator #16164

[dualtor] Improve mux_simulator #16164

Conversation

lolyu commented Dec 19, 2024 • edited Loading

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

mssonicbld commented Dec 19, 2024

azure-pipelines bot commented Dec 19, 2024

Choose a reason for hiding this comment

mssonicbld commented Dec 20, 2024

azure-pipelines bot commented Dec 20, 2024

Copilot AI left a comment

Choose a reason for hiding this comment

mssonicbld commented Dec 20, 2024

azure-pipelines bot commented Dec 20, 2024

mssonicbld commented Jan 7, 2025

[dualtor] Improve `mux_simulator` #16164

[dualtor] Improve `mux_simulator` #16164

lolyu commented Dec 19, 2024 •

edited

Loading