Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[T2][Chassis] tsa-tsb: Add additional timer check before checking tsa-tsb service status #15649

Merged
merged 1 commit into from
Dec 14, 2024

Conversation

sanjair-git
Copy link
Contributor

Description of PR

Summary:
Fixes # (issue)

  • This PR fixes a corner case w.r.t 'test_user_init_tsb_on_sup_while_service_run_on_dut' test under 'test_startup_tsa_tsb_service.py' and adds more check while fetching the 'tsa-tsb' service status on the line cards after applying 'TSB' on supervisor card.
  • The fix makes sure if the service is 'Active' running state and the service uptime is not the same as configured time.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405

Approach

What is the motivation for this PR?

  • The test 'test_user_init_tsb_on_sup_while_service_run_on_dut' fails with the following reason for one of the line card,
            # Issue user initiated TSB on the supervisor
            suphost.shell('TSB')
    
            for linecard in duthosts.frontend_nodes:
                if get_tsa_tsb_service_status(linecard, 'running'):
                    # Verify DUT continues to be in maintenance state if the timer is running.
>                   pytest_assert(TS_MAINTENANCE == get_traffic_shift_state(linecard, cmd='TSC no-stats'),
                                  "DUT is not in maintenance state when startup_tsa_tsb service is running")
E                                 Failed: DUT is not in maintenance state when startup_tsa_tsb service is running
  • This is because when the service status is checked for the second line card after the service run is completed for the first line card, there are chances the following would happen. Service would show as running and runtime also shows the configured time value (900 seconds = 15 min) and the 'TSC' command shows the state as NORMAL already.

**"stdout": " Active: active (running) since Sat 2024-11-16 20:19:21 UTC; 15min ago",**

AnsibleModule::shell, args=["sudo systemctl status startup_tsa_tsb.service | grep 'Active'"], kwargs={}
AnsibleModule::shell Result => {"changed": true, "stdout": "     Active: active (running) since Sat 2024-11-16 20:19:21 UTC; 15min ago", "stderr": "", "rc": 0, "cmd": "sudo systemctl status startup_tsa_tsb.service | grep 'Active'", "start": "2024-11-16 20:34:32.082092", "end": "2024-11-16 20:34:32.107746", "delta": "0:00:00.025654", "msg": "", "invocation": {"module_args": {"_raw_params": "sudo systemctl status startup_tsa_tsb.service | grep 'Active'", "_uses_shell": true, "warn": false, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "executable": null, "creates": null, "removes": null, "stdin": null}}, "stdout_lines": ["     Active: active (running) since Sat 2024-11-16 20:19:21 UTC; 15min ago"], "stderr_lines": [], "_ansible_no_log": null, "failed": false}
AnsibleModule::shell, args=["TSC no-stats"], kwargs={}
AnsibleModule::shell Result => {"changed": true, "stdout": "BGP0 : System Mode: Normal\nBGP1 : System Mode: Normal", "stderr": "", "rc": 0, "cmd": "TSC no-stats", "start": "2024-11-16 20:34:33.002669", "end": "2024-11-16 20:34:39.089304", "delta": "0:00:06.086635", "msg": "", "invocation": {"module_args": {"_raw_params": "TSC no-stats", "_uses_shell": true, "warn": false, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "executable": null, "creates": null, "removes": null, "stdin": null}}, "stdout_lines": ["BGP0 : System Mode: Normal", "BGP1 : System Mode: Normal"], "stderr_lines": [], "_ansible_no_log": null, "failed": false}

Note: only this test case requires this change, other tests won't get into this corner case.

How did you do it?

  • Add a timer check in addition to the existing check, where service runtime is lesser than the configured timer value.

How did you verify/test it?

  • Ran the above-mentioned test case on a T2 chassis and made sure the test passed without any issues.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@Javier-Tan
Copy link
Contributor

Hi Sanjair, sorry I didn't quite understand, are you saying it's possible TSA TSB startup service is still running (get_tsa_tsb_service_status(linecard, 'running')) and TSC shows maintenance modeTS_MAINTENANCE == get_traffic_shift_state(linecard, cmd='TSC no-stats') at the same time?

@sanjair-git
Copy link
Contributor Author

Hi Sanjair, sorry I didn't quite understand, are you saying it's possible TSA TSB startup service is still running (get_tsa_tsb_service_status(linecard, 'running')) and TSC shows maintenance modeTS_MAINTENANCE == get_traffic_shift_state(linecard, cmd='TSC no-stats') at the same time?

Hi @Javier-Tan,

Let me rephrase the test scenario here in detail,

  1. Initially, Supervisor and all line cards are in TS_NORMAL state.
  2. Reboot supervisor card which in turn reboots all line cards.
  3. When line cards come up, 'tsa-tsb' service starts running and applies 'TSA' config which brings the traffic_shift_state of line cards to be in 'TS_MAINTENANCE' state.
  4. When the service is running, as part of the test case, we are applying 'TSB' command on the supervisor. This makes 'CHASSIS_APP_DB' tsa_enabled config set to 'false' for all. (superviosr & line cards). During this time, line card's CONFIG_DB is still 'true' as tsa-tsb service is running.

During Step 4 above, test case handles two cases. What if the line card's tsa-tsb service is still running or service is already done and exited after timer expiry.

  • tsa-tsb service running case - we make sure traffic_shift_state is TS_MAINTENANCE
  • tsa-tsb service exited case - we make srue traffic_shift_state is TS_NORMAL

As part of the above check, when we get the status of tsa-tsb service, there could be chances for the second/third line card to show the status like this shown below once completing the check for first line card. (15 mins = 900 seconds is the max configured value on the line card for tsa-tsb service)

DEBUG tests.common.devices.base:base.py:108 /data/tests/common/devices/multi_asic.py::_run_on_asics#135: [ixre-egl-board7] AnsibleModule::shell Result => {"changed": true, "stdout": " **Active: active (running)** since Sat 2024-11-16 20:19:21 UTC; **15min ago**",

At this 15th minute, tsa-tsb service applies 'TSB' on the line card and changes the traffic-shift-state to 'TS_NORMAL' already in the background even though the service status shows 'active (running)'. To handle this, I have added the above code change for this particular test case.

@Javier-Tan
Copy link
Contributor

Hi @sanjair-git, thanks for the clarification. I think this is a safe fix, I wonder if the service being active after it's finished its duty should be investigated, perhaps it is on purpose for cleanup.

@sanjair-git
Copy link
Contributor Author

Hi @sanjair-git, thanks for the clarification. I think this is a safe fix, I wonder if the service being active after it's finished its duty should be investigated, perhaps it is on purpose for cleanup.

@Javier-Tan, from what I noticed, yes, it is on the purpose of cleanup. Can you please approve if you are ok with the fix?

@rlhui rlhui merged commit 47ef91c into sonic-net:master Dec 14, 2024
19 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Dec 17, 2024
)

This PR fixes a corner case w.r.t 'test_user_init_tsb_on_sup_while_service_run_on_dut' test under 'test_startup_tsa_tsb_service.py' and adds more check while fetching the 'tsa-tsb' service status on the line cards after applying 'TSB' on supervisor card.
The fix makes sure if the service is 'Active' running state and the service uptime is not the same as configured time.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202405: #16101

mssonicbld pushed a commit that referenced this pull request Dec 17, 2024
This PR fixes a corner case w.r.t 'test_user_init_tsb_on_sup_while_service_run_on_dut' test under 'test_startup_tsa_tsb_service.py' and adds more check while fetching the 'tsa-tsb' service status on the line cards after applying 'TSB' on supervisor card.
The fix makes sure if the service is 'Active' running state and the service uptime is not the same as configured time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants