Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dualtor][mux_simulator] Fix mux simulator stuck #15226

Merged
merged 2 commits into from
Oct 31, 2024

Conversation

lolyu
Copy link
Contributor

@lolyu lolyu commented Oct 29, 2024

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405

Approach

What is the motivation for this PR?

Active-standby Dualtor is failing to talk to mux_simulator:

# curl -v http://10.64.246.154:8082/mux/vms24-7/24
*   Trying 10.64.246.154:8082...
  • on the test server, TCP syn drops are reported increasing:
# netstat -s | grep -i listen
    1531500 times the listen queue of a socket overflowed
    1531501 SYNs to LISTEN sockets dropped
  • mux simulator sync queue is overflowing:
# ss -lnt
State                     Recv-Q                     Send-Q                                          Local Address:Port                                         Peer Address:Port
LISTEN                    129                          128                                                   0.0.0.0:8082                                              0.0.0.0:*
  • It appeared that mux_simulator is stuck in the recvfrom:
# strace -p 21315
strace: Process 21315 attached
recvfrom(6,
  • and there is no existing TCP connection on the test server/DUT for fd 6.

mux_simulator is blocking reading from an already closed TCP connection, so subsequent HTTP requests cannot be handled properly, which resulted in the TCP sync queue overflow.

How did you do it?

  1. Enable mux_simulator to work in threaded mode.
  2. Set socket timeout to 60s, if a worker thread stucks in the recvfrom like this, this will ensure the work thread exits after 60s, so no resource leak.

How did you verify/test it?

Run mux_simulator with the change.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

wangxin
wangxin previously approved these changes Oct 29, 2024
@mssonicbld
Copy link
Collaborator

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Passed
fix end of files.........................................................Failed
- hook id: end-of-file-fixer
- exit code: 1
- files were modified by this hook

Fixing ansible/roles/vm_set/files/mux_simulator.py

check yaml...........................................(no files to check)Skipped
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Passed
flake8...............................................(no files to check)Skipped
check conditional mark sort..........................(no files to check)Skipped

To run the pre-commit checks locally, you can follow below steps:

  1. Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
    the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
    docker container.
  2. Ensure that the pre-commit package is installed:
sudo pip install pre-commit
  1. Go to repository root folder
  2. Install the pre-commit hooks:
pre-commit install
  1. Use pre-commit to check staged file:
pre-commit
  1. Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>

Signed-off-by: Longxiang Lyu <[email protected]>
@lolyu lolyu force-pushed the fix_mux_simulator_stuck branch from 4d468b4 to 0d23f53 Compare October 30, 2024 03:27
@wangxin wangxin merged commit 13920a3 into sonic-net:master Oct 31, 2024
15 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Oct 31, 2024
What is the motivation for this PR?
Active-standby Dualtor is failing to talk to mux_simulator:

# curl -v http://10.64.246.154:8082/mux/vms24-7/24
*   Trying 10.64.246.154:8082...

on the test server, TCP syn drops are reported increasing:
# netstat -s | grep -i listen
    1531500 times the listen queue of a socket overflowed
    1531501 SYNs to LISTEN sockets dropped

mux simulator sync queue is overflowing:
# ss -lnt
State                     Recv-Q                     Send-Q                                          Local Address:Port                                         Peer Address:Port
LISTEN                    129                          128                                                   0.0.0.0:8082                                              0.0.0.0:*

It appeared that mux_simulator is stuck in the recvfrom:
# strace -p 21315
strace: Process 21315 attached
recvfrom(6,

and there is no existing TCP connection on the test server/DUT for fd 6.
mux_simulator is blocking reading from an already closed TCP connection, so subsequent HTTP requests cannot be handled properly, which resulted in the TCP sync queue overflow.

How did you do it?
Enable mux_simulator to work in threaded mode.
Set socket timeout to 60s, if a worker thread stucks in the recvfrom like this, this will ensure the work thread exits after 60s, so no resource leak.

How did you verify/test it?
Run mux_simulator with the change.

Signed-off-by: Longxiang Lyu <[email protected]>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202405: #15294

mssonicbld pushed a commit that referenced this pull request Oct 31, 2024
What is the motivation for this PR?
Active-standby Dualtor is failing to talk to mux_simulator:

# curl -v http://10.64.246.154:8082/mux/vms24-7/24
*   Trying 10.64.246.154:8082...

on the test server, TCP syn drops are reported increasing:
# netstat -s | grep -i listen
    1531500 times the listen queue of a socket overflowed
    1531501 SYNs to LISTEN sockets dropped

mux simulator sync queue is overflowing:
# ss -lnt
State                     Recv-Q                     Send-Q                                          Local Address:Port                                         Peer Address:Port
LISTEN                    129                          128                                                   0.0.0.0:8082                                              0.0.0.0:*

It appeared that mux_simulator is stuck in the recvfrom:
# strace -p 21315
strace: Process 21315 attached
recvfrom(6,

and there is no existing TCP connection on the test server/DUT for fd 6.
mux_simulator is blocking reading from an already closed TCP connection, so subsequent HTTP requests cannot be handled properly, which resulted in the TCP sync queue overflow.

How did you do it?
Enable mux_simulator to work in threaded mode.
Set socket timeout to 60s, if a worker thread stucks in the recvfrom like this, this will ensure the work thread exits after 60s, so no resource leak.

How did you verify/test it?
Run mux_simulator with the change.

Signed-off-by: Longxiang Lyu <[email protected]>
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Nov 5, 2024
What is the motivation for this PR?
Active-standby Dualtor is failing to talk to mux_simulator:

# curl -v http://10.64.246.154:8082/mux/vms24-7/24
*   Trying 10.64.246.154:8082...

on the test server, TCP syn drops are reported increasing:
# netstat -s | grep -i listen
    1531500 times the listen queue of a socket overflowed
    1531501 SYNs to LISTEN sockets dropped

mux simulator sync queue is overflowing:
# ss -lnt
State                     Recv-Q                     Send-Q                                          Local Address:Port                                         Peer Address:Port
LISTEN                    129                          128                                                   0.0.0.0:8082                                              0.0.0.0:*

It appeared that mux_simulator is stuck in the recvfrom:
# strace -p 21315
strace: Process 21315 attached
recvfrom(6,

and there is no existing TCP connection on the test server/DUT for fd 6.
mux_simulator is blocking reading from an already closed TCP connection, so subsequent HTTP requests cannot be handled properly, which resulted in the TCP sync queue overflow.

How did you do it?
Enable mux_simulator to work in threaded mode.
Set socket timeout to 60s, if a worker thread stucks in the recvfrom like this, this will ensure the work thread exits after 60s, so no resource leak.

How did you verify/test it?
Run mux_simulator with the change.

Signed-off-by: Longxiang Lyu <[email protected]>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202311: #15354

mssonicbld pushed a commit that referenced this pull request Nov 5, 2024
What is the motivation for this PR?
Active-standby Dualtor is failing to talk to mux_simulator:

# curl -v http://10.64.246.154:8082/mux/vms24-7/24
*   Trying 10.64.246.154:8082...

on the test server, TCP syn drops are reported increasing:
# netstat -s | grep -i listen
    1531500 times the listen queue of a socket overflowed
    1531501 SYNs to LISTEN sockets dropped

mux simulator sync queue is overflowing:
# ss -lnt
State                     Recv-Q                     Send-Q                                          Local Address:Port                                         Peer Address:Port
LISTEN                    129                          128                                                   0.0.0.0:8082                                              0.0.0.0:*

It appeared that mux_simulator is stuck in the recvfrom:
# strace -p 21315
strace: Process 21315 attached
recvfrom(6,

and there is no existing TCP connection on the test server/DUT for fd 6.
mux_simulator is blocking reading from an already closed TCP connection, so subsequent HTTP requests cannot be handled properly, which resulted in the TCP sync queue overflow.

How did you do it?
Enable mux_simulator to work in threaded mode.
Set socket timeout to 60s, if a worker thread stucks in the recvfrom like this, this will ensure the work thread exits after 60s, so no resource leak.

How did you verify/test it?
Run mux_simulator with the change.

Signed-off-by: Longxiang Lyu <[email protected]>
sreejithsreekumaran pushed a commit to sreejithsreekumaran/sonic-mgmt that referenced this pull request Nov 15, 2024
What is the motivation for this PR?
Active-standby Dualtor is failing to talk to mux_simulator:

# curl -v http://10.64.246.154:8082/mux/vms24-7/24
*   Trying 10.64.246.154:8082...

on the test server, TCP syn drops are reported increasing:
# netstat -s | grep -i listen
    1531500 times the listen queue of a socket overflowed
    1531501 SYNs to LISTEN sockets dropped

mux simulator sync queue is overflowing:
# ss -lnt
State                     Recv-Q                     Send-Q                                          Local Address:Port                                         Peer Address:Port
LISTEN                    129                          128                                                   0.0.0.0:8082                                              0.0.0.0:*

It appeared that mux_simulator is stuck in the recvfrom:
# strace -p 21315
strace: Process 21315 attached
recvfrom(6,

and there is no existing TCP connection on the test server/DUT for fd 6.
mux_simulator is blocking reading from an already closed TCP connection, so subsequent HTTP requests cannot be handled properly, which resulted in the TCP sync queue overflow.

How did you do it?
Enable mux_simulator to work in threaded mode.
Set socket timeout to 60s, if a worker thread stucks in the recvfrom like this, this will ensure the work thread exits after 60s, so no resource leak.

How did you verify/test it?
Run mux_simulator with the change.

Signed-off-by: Longxiang Lyu <[email protected]>
yutongzhang-microsoft pushed a commit to yutongzhang-microsoft/sonic-mgmt that referenced this pull request Nov 21, 2024
What is the motivation for this PR?
Active-standby Dualtor is failing to talk to mux_simulator:

# curl -v http://10.64.246.154:8082/mux/vms24-7/24
*   Trying 10.64.246.154:8082...

on the test server, TCP syn drops are reported increasing:
# netstat -s | grep -i listen
    1531500 times the listen queue of a socket overflowed
    1531501 SYNs to LISTEN sockets dropped

mux simulator sync queue is overflowing:
# ss -lnt
State                     Recv-Q                     Send-Q                                          Local Address:Port                                         Peer Address:Port
LISTEN                    129                          128                                                   0.0.0.0:8082                                              0.0.0.0:*

It appeared that mux_simulator is stuck in the recvfrom:
# strace -p 21315
strace: Process 21315 attached
recvfrom(6,

and there is no existing TCP connection on the test server/DUT for fd 6.
mux_simulator is blocking reading from an already closed TCP connection, so subsequent HTTP requests cannot be handled properly, which resulted in the TCP sync queue overflow.

How did you do it?
Enable mux_simulator to work in threaded mode.
Set socket timeout to 60s, if a worker thread stucks in the recvfrom like this, this will ensure the work thread exits after 60s, so no resource leak.

How did you verify/test it?
Run mux_simulator with the change.

Signed-off-by: Longxiang Lyu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants