[action] [PR:16164] [dualtor] Improve mux_simulator
#16369
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of PR
Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
The dualtor nightly suffers from mux simulator timeout issue, and there are always HTTP timeout failures observed.
This PR tries to improve the mux simulator performance:
PR 1522 was a quick fix to address, but it was a temporary quick fix.
How did you do it?
gunicorn
instead of its own built-in HTTP server.The mux simulator is running with
Flask
's own built-in HTTP server. Previously, the mux simulator is running with single-threaded mode, which limits its performance && throughput; and the mux simulator is observed stuck in reading from dead connection; PR 1522 proposes a temporary by running mux simulator in threaded mode. The throughput is improved with the threaded approach, but the built-in server limits the tcp listen backlog to 128, and it is designed for development/test purpose and not recommended to be deployed(Flask
's deployment doc).So let's run the mux simulator with
gunicorn
:The mux simulator handles the toggle-all request by toggling each mux port one by one, let's use a thread pool to parallel run thoses toggles to further decrease the response time.
How did you verify/test it?
Run the following benchmarks on a dualtor-120 testbed, and compare the performance of:
Flask
built-in server in single-thread mode.Flask
built-in server in threaded mode.To summarize, mux simulator with this PR has the best performance in toggles.
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation