Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redis omem leaking issue on T2 supervisor #20680

Closed
sdszhang opened this issue Nov 4, 2024 · 9 comments
Closed

redis omem leaking issue on T2 supervisor #20680

sdszhang opened this issue Nov 4, 2024 · 9 comments
Assignees
Labels
chassis-packet P0 Priority of the issue Triaged this issue has been triaged

Comments

@sdszhang
Copy link

sdszhang commented Nov 4, 2024

Description

We are seeing memory leaking issue on T2 Supervisor when running nightly test, which caused the redis memory keeps increasing until it fails sanity_check in sonic-mgmt.

Following is one of the log which memory sanity_check threshold.

06/10/2024 05:26:43 checks._check_dbmemory_on_dut            L0303 INFO   | asic0 db memory over the threshold 
06/10/2024 05:26:43 checks._check_dbmemory_on_dut            L0304 INFO   | asic0 db memory omem non-zero output: 
id=1367 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=262 name= age=36559 idle=20085 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=29 omem=594616 tot-mem=597464 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1368 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=263 name= age=36559 idle=19157 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=15 omem=307560 tot-mem=310408 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1369 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=264 name= age=36559 idle=20717 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=57 omem=1168728 tot-mem=1171576 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1370 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=265 name= age=36559 idle=21043 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=464 omem=9513856 tot-mem=9516704 events=rw cmd=psubscribe user=default redir=-1 resp=2
06/10/2024 05:26:43 checks._check_dbmemory_on_dut            L0307 INFO   | Done checking database memory on svcstr2-8800-sup-1

06/10/2024 05:26:45 parallel.parallel_run                    L0221 INFO   | Completed running processes for target "_check_dbmemory_on_dut" in 0:00:02.809825 seconds
06/10/2024 05:26:45 __init__.do_checks                       L0120 DEBUG  | check results of each item [{'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc1-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc2-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc3-1', 'total_omem': 0}, {'failed': True, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-sup-1', 'total_omem': 11584760}]

The memory leaking was seen after running one of the following 3 modules. once the total_omem becomes non-zero, it will keep increasing until it's over the threshold.

system_health/test_system_health.py
platform_tests/api/test_fan_drawer_fans.py
platform_tests/test_platform_info.py

Steps to reproduce the issue:

  1. Run full nightly test on a T2 testbed.

Describe the results you received:

testbed fails sanity check due to omem over threshold after running nightly test on T2 testbed.

Describe the results you expected:

redis omem should be released after usage, should not keep increasing.

Output of show version:

admin@svcstr2-8800-sup-1:~$ show version

SONiC Software Version: SONiC.jianquan.cicso.202405.08
SONiC OS Version: 12
Distribution: Debian 12.6
Kernel: 6.1.0-22-2-amd64
Build commit: b60548f2f6
Build date: Fri Nov  1 11:20:02 UTC 2024
Built by: azureuser@00df58e3c000000

Platform: x86_64-8800_rp-r0
HwSKU: Cisco-8800-RP
ASIC: cisco-8000
ASIC Count: 10
Serial Number: FOC2545N2CA
Model Number: 8800-RP
Hardware Revision: 1.0
Uptime: 00:50:48 up 14:22,  3 users,  load average: 13.28, 12.07, 11.65
Date: Mon 04 Nov 2024 00:50:48

Output of show techsupport:

When running system_health/test_system_health.py test:
At the beginning of the test:

05/10/2024 23:23:32 __init__.do_checks                       L0120 DEBUG  | check results of each item [{'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc1-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc2-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc3-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-sup-1', 'total_omem': 0}]

At the end of the test:

06/10/2024 00:02:03 __init__.do_checks                       L0120 DEBUG  | check results of each item [{'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc1-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc2-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc3-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-sup-1', 'total_omem': 861168}]

This symptom is observed for all 3 test cases so far:

system_health/test_system_health.py
platform_tests/api/test_fan_drawer_fans.py
platform_tests/test_platform_info.py

Additional information you deem important (e.g. issue happens only occasionally):

@sdszhang sdszhang changed the title redis memory leaking issue on T2 supervisor redis omem leaking issue on T2 supervisor Nov 4, 2024
@arlakshm
Copy link
Contributor

arlakshm commented Nov 6, 2024

@anamehra, @abdosi, can you please help triage this issue

@arlakshm arlakshm added the Triaged this issue has been triaged label Nov 6, 2024
@anamehra
Copy link
Contributor

Issue is not seen in last few runs in Cisco and MSFT testbed.
Looks like some redis client of global database docker on Supervisor fails to read buffere from redis and this causes omem to increase. Platform does not have any redis client for global database. Could be some Sonic infra client. Needs a repro to debug further.

is there a way to map this client data to the client process? The id / fd from here were not very helpful to pinpoint the client.

id=1367 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=262 name= age=36559 idle=20085 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=29 omem=594616 tot-mem=597464 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1368 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=263 name= age=36559 idle=19157 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=15 omem=307560 tot-mem=310408 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1369 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=264 name= age=36559 idle=20717 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=57 omem=1168728 tot-mem=1171576 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1370 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=265 name= age=36559 idle=21043 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=464 omem=9513856 tot-mem=9516704 events=rw cmd=psubscribe user=default redir=-1 resp=2

@anamehra
Copy link
Contributor

quick update: The client connections which are leaking mem are from snmp docker. I see 100+ client connections from snmp and restarting process like thermalctld in pmon causes the omem increase in snmp connections.

@abdosi
Copy link
Contributor

abdosi commented Nov 22, 2024

@SuvarnaMeenakshi : can you help looking into this.

@anamehra
Copy link
Contributor

anamehra commented Dec 5, 2024

Hi @SuvarnaMeenakshi , did you get a chance to look into this issue? Thanks

@rlhui rlhui added P0 Priority of the issue chassis-packet labels Dec 6, 2024
@yejianquan
Copy link
Contributor

I have confirmed and know the RCA of the memory leaking, is drafting the fix PR.
@sdszhang @cyw233
You shared 3 test modules to reproduce this memory leaking issue,

system_health/test_system_health.py
platform_tests/api/test_fan_drawer_fans.py
platform_tests/test_platform_info.py

and narrow down to the command

docker exec -i pmon bash -c 'supervisorctl restart thermalctld'

Can you confirm the command is the only and shared trigger of the 3 test modules?
I can confirm there's bug behind it, but want to confirm whether there's other ways to reproduce it, because it could mean different bugs behind.

@yejianquan
Copy link
Contributor

I have confirmed and know the RCA of the memory leaking, is drafting the fix PR. @sdszhang @cyw233 You shared 3 test modules to reproduce this memory leaking issue,

system_health/test_system_health.py
platform_tests/api/test_fan_drawer_fans.py
platform_tests/test_platform_info.py

and narrow down to the command

docker exec -i pmon bash -c 'supervisorctl restart thermalctld'

Can you confirm the command is the only and shared trigger of the 3 test modules? I can confirm there's bug behind it, but want to confirm whether there's other ways to reproduce it, because it could mean different bugs behind.

offline synced with Chenyang and Shawn, docker exec -i pmon bash -c 'supervisorctl restart thermalctld' is the shared trigger of the memory leaking issue

@yejianquan
Copy link
Contributor

yejianquan commented Dec 17, 2024

The redis memory leaking is caused by 2 issues., more details in the 2 issues and fixes

snmpagent

snmpagent has a memory leak issue, it will be triggered when an never-autorecovered exception happens

Issue: Redis memory leak risk in PhysicalEntityCacheUpdater #342
Fix: Fix redis memory leak issue in PhysicalEntityCacheUpdater #343

pmon

pmon on chassis will enter a wrong state that won't auto-recover, which triggers the memory leaking

Issue: [chassis] PSU keys(generated by psud) got removed by the restart of thermalctld and won't auto recover. #575
Fix: [chassis][psud] Move the PSU parent information generation to the loop run function from the initialization function #576

@yejianquan
Copy link
Contributor

the issue can be closed after the 2 fix PRs got merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chassis-packet P0 Priority of the issue Triaged this issue has been triaged
Projects
Status: Done
Development

No branches or pull requests

6 participants