redis omem leaking issue on T2 supervisor #20680

sdszhang · 2024-11-04T00:55:10Z

Description

We are seeing memory leaking issue on T2 Supervisor when running nightly test, which caused the redis memory keeps increasing until it fails sanity_check in sonic-mgmt.

Following is one of the log which memory sanity_check threshold.

06/10/2024 05:26:43 checks._check_dbmemory_on_dut            L0303 INFO   | asic0 db memory over the threshold 
06/10/2024 05:26:43 checks._check_dbmemory_on_dut            L0304 INFO   | asic0 db memory omem non-zero output: 
id=1367 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=262 name= age=36559 idle=20085 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=29 omem=594616 tot-mem=597464 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1368 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=263 name= age=36559 idle=19157 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=15 omem=307560 tot-mem=310408 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1369 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=264 name= age=36559 idle=20717 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=57 omem=1168728 tot-mem=1171576 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1370 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=265 name= age=36559 idle=21043 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=464 omem=9513856 tot-mem=9516704 events=rw cmd=psubscribe user=default redir=-1 resp=2
06/10/2024 05:26:43 checks._check_dbmemory_on_dut            L0307 INFO   | Done checking database memory on svcstr2-8800-sup-1

06/10/2024 05:26:45 parallel.parallel_run                    L0221 INFO   | Completed running processes for target "_check_dbmemory_on_dut" in 0:00:02.809825 seconds
06/10/2024 05:26:45 __init__.do_checks                       L0120 DEBUG  | check results of each item [{'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc1-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc2-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc3-1', 'total_omem': 0}, {'failed': True, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-sup-1', 'total_omem': 11584760}]

The memory leaking was seen after running one of the following 3 modules. once the total_omem becomes non-zero, it will keep increasing until it's over the threshold.

system_health/test_system_health.py
platform_tests/api/test_fan_drawer_fans.py
platform_tests/test_platform_info.py

Steps to reproduce the issue:

Run full nightly test on a T2 testbed.

Describe the results you received:

testbed fails sanity check due to omem over threshold after running nightly test on T2 testbed.

Describe the results you expected:

redis omem should be released after usage, should not keep increasing.

Output of `show version`:

admin@svcstr2-8800-sup-1:~$ show version

SONiC Software Version: SONiC.jianquan.cicso.202405.08
SONiC OS Version: 12
Distribution: Debian 12.6
Kernel: 6.1.0-22-2-amd64
Build commit: b60548f2f6
Build date: Fri Nov  1 11:20:02 UTC 2024
Built by: azureuser@00df58e3c000000

Platform: x86_64-8800_rp-r0
HwSKU: Cisco-8800-RP
ASIC: cisco-8000
ASIC Count: 10
Serial Number: FOC2545N2CA
Model Number: 8800-RP
Hardware Revision: 1.0
Uptime: 00:50:48 up 14:22,  3 users,  load average: 13.28, 12.07, 11.65
Date: Mon 04 Nov 2024 00:50:48

Output of `show techsupport`:

When running system_health/test_system_health.py test:
At the beginning of the test:

05/10/2024 23:23:32 __init__.do_checks                       L0120 DEBUG  | check results of each item [{'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc1-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc2-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc3-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-sup-1', 'total_omem': 0}]

At the end of the test:

06/10/2024 00:02:03 __init__.do_checks                       L0120 DEBUG  | check results of each item [{'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc1-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc2-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc3-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-sup-1', 'total_omem': 861168}]

This symptom is observed for all 3 test cases so far:

system_health/test_system_health.py
platform_tests/api/test_fan_drawer_fans.py
platform_tests/test_platform_info.py

Additional information you deem important (e.g. issue happens only occasionally):

The text was updated successfully, but these errors were encountered:

arlakshm · 2024-11-06T16:44:34Z

@anamehra, @abdosi, can you please help triage this issue

anamehra · 2024-11-13T17:20:11Z

Issue is not seen in last few runs in Cisco and MSFT testbed.
Looks like some redis client of global database docker on Supervisor fails to read buffere from redis and this causes omem to increase. Platform does not have any redis client for global database. Could be some Sonic infra client. Needs a repro to debug further.

is there a way to map this client data to the client process? The id / fd from here were not very helpful to pinpoint the client.

id=1367 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=262 name= age=36559 idle=20085 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=29 omem=594616 tot-mem=597464 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1368 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=263 name= age=36559 idle=19157 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=15 omem=307560 tot-mem=310408 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1369 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=264 name= age=36559 idle=20717 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=57 omem=1168728 tot-mem=1171576 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1370 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=265 name= age=36559 idle=21043 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=464 omem=9513856 tot-mem=9516704 events=rw cmd=psubscribe user=default redir=-1 resp=2

anamehra · 2024-11-22T19:40:04Z

quick update: The client connections which are leaking mem are from snmp docker. I see 100+ client connections from snmp and restarting process like thermalctld in pmon causes the omem increase in snmp connections.

abdosi · 2024-11-22T20:39:00Z

@SuvarnaMeenakshi : can you help looking into this.

anamehra · 2024-12-05T06:07:14Z

Hi @SuvarnaMeenakshi , did you get a chance to look into this issue? Thanks

yejianquan · 2024-12-15T07:42:58Z

I have confirmed and know the RCA of the memory leaking, is drafting the fix PR.
@sdszhang @cyw233
You shared 3 test modules to reproduce this memory leaking issue,

system_health/test_system_health.py
platform_tests/api/test_fan_drawer_fans.py
platform_tests/test_platform_info.py

and narrow down to the command

docker exec -i pmon bash -c 'supervisorctl restart thermalctld'

Can you confirm the command is the only and shared trigger of the 3 test modules?
I can confirm there's bug behind it, but want to confirm whether there's other ways to reproduce it, because it could mean different bugs behind.

yejianquan · 2024-12-17T08:54:32Z

I have confirmed and know the RCA of the memory leaking, is drafting the fix PR. @sdszhang @cyw233 You shared 3 test modules to reproduce this memory leaking issue,
system_health/test_system_health.py
platform_tests/api/test_fan_drawer_fans.py
platform_tests/test_platform_info.py
and narrow down to the command
docker exec -i pmon bash -c 'supervisorctl restart thermalctld'
Can you confirm the command is the only and shared trigger of the 3 test modules? I can confirm there's bug behind it, but want to confirm whether there's other ways to reproduce it, because it could mean different bugs behind.

offline synced with Chenyang and Shawn, docker exec -i pmon bash -c 'supervisorctl restart thermalctld' is the shared trigger of the memory leaking issue

yejianquan · 2024-12-17T09:02:58Z

The redis memory leaking is caused by 2 issues., more details in the 2 issues and fixes

snmpagent

snmpagent has a memory leak issue, it will be triggered when an never-autorecovered exception happens

Issue: Redis memory leak risk in PhysicalEntityCacheUpdater #342

Fix: Fix redis memory leak issue in PhysicalEntityCacheUpdater #343

pmon

pmon on chassis will enter a wrong state that won't auto-recover, which triggers the memory leaking

Issue: [chassis] PSU keys(generated by psud) got removed by the restart of thermalctld and won't auto recover. #575

Fix: [chassis][psud] Move the PSU parent information generation to the loop run function from the initialization function #576

yejianquan · 2024-12-17T09:03:24Z

the issue can be closed after the 2 fix PRs got merged

sdszhang changed the title ~~redis memory leaking issue on T2 supervisor~~ redis omem leaking issue on T2 supervisor Nov 4, 2024

arlakshm added the Triaged this issue has been triaged label Nov 6, 2024

rlhui added this to SONiC Chassis Dec 6, 2024

rlhui added P0 Priority of the issue chassis-packet labels Dec 6, 2024

rlhui assigned yejianquan Dec 11, 2024

yejianquan mentioned this issue Dec 16, 2024

Redis memory leak risk in PhysicalEntityCacheUpdater sonic-net/sonic-snmpagent#342

Open

yejianquan closed this as completed Dec 19, 2024

github-project-automation bot moved this to Done in SONiC Chassis Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

redis omem leaking issue on T2 supervisor #20680

redis omem leaking issue on T2 supervisor #20680

sdszhang commented Nov 4, 2024

arlakshm commented Nov 6, 2024

anamehra commented Nov 13, 2024

anamehra commented Nov 22, 2024

abdosi commented Nov 22, 2024

anamehra commented Dec 5, 2024

yejianquan commented Dec 15, 2024

yejianquan commented Dec 17, 2024

yejianquan commented Dec 17, 2024 •

edited

Loading

yejianquan commented Dec 17, 2024

redis omem leaking issue on T2 supervisor #20680

redis omem leaking issue on T2 supervisor #20680

Comments

sdszhang commented Nov 4, 2024

Description

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

Output of show version:

Output of show techsupport:

Additional information you deem important (e.g. issue happens only occasionally):

arlakshm commented Nov 6, 2024

anamehra commented Nov 13, 2024

anamehra commented Nov 22, 2024

abdosi commented Nov 22, 2024

anamehra commented Dec 5, 2024

yejianquan commented Dec 15, 2024

yejianquan commented Dec 17, 2024

yejianquan commented Dec 17, 2024 • edited Loading

snmpagent

snmpagent has a memory leak issue, it will be triggered when an never-autorecovered exception happens

Issue: Redis memory leak risk in PhysicalEntityCacheUpdater #342

Fix: Fix redis memory leak issue in PhysicalEntityCacheUpdater #343

pmon

pmon on chassis will enter a wrong state that won't auto-recover, which triggers the memory leaking

Issue: [chassis] PSU keys(generated by psud) got removed by the restart of thermalctld and won't auto recover. #575

Fix: [chassis][psud] Move the PSU parent information generation to the loop run function from the initialization function #576

yejianquan commented Dec 17, 2024

Output of `show version`:

Output of `show techsupport`:

yejianquan commented Dec 17, 2024 •

edited

Loading