Redis memory leak risk in PhysicalEntityCacheUpdater #342

yejianquan · 2024-12-16T11:08:58Z

Description

Steps to reproduce the issue:

Start the snmp docker container
Trigger a never auto-recovered error in any of the physical_entity_updater, for example: Remove the key in state db: PHYSICAL_ENTITY_INFO|PSU 2, this will cause ValueError, because it tries to cast a empty string to int value

2024 Dec 13 08:32:42.100531 x-sup-2 ERR snmp#snmp-subagent [ax_interface] ERROR: MIBUpdater.start() caught an unexpected exception during update_data()#012Traceback (most recent call last):#012  File "/usr/local/lib/python3.11/dist-packages/ax_interface/mib.py", line 45, in start#012    self.reinit_data()#012  File "/usr/local/lib/python3.11/dist-packages/sonic_ax_impl/mibs/ietf/rfc2737.py", line 323, in reinit_data#012    updater.reinit_data()#012  File "/usr/local/lib/python3.11/dist-packages/sonic_ax_impl/mibs/ietf/rfc2737.py", line 664, in reinit_data#012    self._update_entity_cache(name)#012  File "/usr/local/lib/python3.11/dist-packages/sonic_ax_impl/mibs/ietf/rfc2737.py", line 865, in _update_entity_cache#012    psu_position = int(psu_position)#012

The omem of the redis client starts to increase slowly, some output from my test script:

Iteration: 26
ASIC: 0, total_omem: 0, result: False
ASIC: 1, total_omem: 0, result: False
ASIC: 4, total_omem: 0, result: False
ASIC: 5, total_omem: 0, result: False
ASIC: 8, total_omem: 0, result: False
ASIC: 9, total_omem: 0, result: False
ASIC: 10, total_omem: 0, result: False
ASIC: 11, total_omem: 0, result: False
ASIC: 12, total_omem: 0, result: False
ASIC: 13, total_omem: 0, result: False
Sleep 5 seconds
---------------------------------------------------------------------------------
Iteration: 27
ASIC: 0, total_omem: 20504, result: False
ASIC: 1, total_omem: 20504, result: False
ASIC: 4, total_omem: 20504, result: False
ASIC: 5, total_omem: 20504, result: False
ASIC: 8, total_omem: 20504, result: False
ASIC: 9, total_omem: 20504, result: False
ASIC: 10, total_omem: 20504, result: False
ASIC: 11, total_omem: 20504, result: False
ASIC: 12, total_omem: 20504, result: False
ASIC: 13, total_omem: 20504, result: False
Sleep 5 seconds
---------------------------------------------------------------------------------

...

---------------------------------------------------------------------------------
Iteration: 83
ASIC: 0, total_omem: 225544, result: False
ASIC: 1, total_omem: 225544, result: False
ASIC: 4, total_omem: 225544, result: False
ASIC: 5, total_omem: 225544, result: False
ASIC: 8, total_omem: 225544, result: False
ASIC: 9, total_omem: 225544, result: False
ASIC: 10, total_omem: 225544, result: False
ASIC: 11, total_omem: 225544, result: False
ASIC: 12, total_omem: 225544, result: False
ASIC: 13, total_omem: 225544, result: False
Sleep 5 seconds
---------------------------------------------------------------------------------
Iteration: 84
ASIC: 0, total_omem: 225544, result: False
ASIC: 1, total_omem: 225544, result: False
ASIC: 4, total_omem: 225544, result: False
ASIC: 5, total_omem: 225544, result: False
ASIC: 8, total_omem: 225544, result: False
ASIC: 9, total_omem: 225544, result: False
ASIC: 10, total_omem: 225544, result: False
ASIC: 11, total_omem: 225544, result: False
ASIC: 12, total_omem: 225544, result: False
ASIC: 13, total_omem: 225544, result: False
Sleep 5 seconds
---------------------------------------------------------------------------------

Describe the results you received:
The redis memory gradually increase, have to restart snmp docker container to clear it.

Describe the results you expected:
Redis memory should not leak.

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```

The text was updated successfully, but these errors were encountered:

yejianquan · 2024-12-16T11:12:18Z

one of the RCA of sonic-net/sonic-buildimage#20680

yejianquan · 2024-12-18T09:38:17Z

Detailed RCA of this issue in the comments of #343

- What I did Fixes the redis memory leak bug: #342 There's chance that the physical_entity_updater creates subscriptions to redis, and never consume the messages due to exceptions. Then the memory buffer(omem) of redis client starts to increase and never end, redis memory leaks. The reason is all 5 physical entity cache updaters inherit from PhysicalEntityCacheUpdater. In the first update_data, they initialize the psubscription to redis database. self.pub_sub_dict[db_index] = mibs.get_redis_pubsub(db, db.STATE_DB, self.get_key_pattern()) And everytime when the update_data is called again, it get the message from the psub and process. msg = pubsub.get_message() And outside, in the logic of the MIBUpdater, it calls update_data more frequently than reinit_data. A side-effect is, if reinit_data failed forever, the update_counter will not been cleaned, then update_data will not be called forever. self.update_counter = 0 So the problem is, at the begining, the psub is created at the first update_data and all things work well, until an unrecoverable issue happened, PHYSICAL_ENTITY_INFO|PSU * missing in the database (it's a pmon issue) This causes both reinit_data and update_data to be failed, because all of them finally call _update_per_namespace_data, which tries to cast an empty string '' to int and raises ValueError. Then the update_data is not called forever, because reinit_data will never success. But the previously established psubscription is still there, and no one gonna to consume it(the update_data is blocked), then Redis database memory starts to slowly leak. - How I did it Catch the exception during the loop of reinit_data, make sure the reinit_data of every physical_entity_updater will be called Clear message and cancel the subscription in the reinit_data, avoid the message accumulates in the redis subscription queue - How to verify it Tested on Cisco chassis, the memory is not leaking anymore.

) - What I did Fixes the redis memory leak bug: sonic-net#342 There's chance that the physical_entity_updater creates subscriptions to redis, and never consume the messages due to exceptions. Then the memory buffer(omem) of redis client starts to increase and never end, redis memory leaks. The reason is all 5 physical entity cache updaters inherit from PhysicalEntityCacheUpdater. In the first update_data, they initialize the psubscription to redis database. self.pub_sub_dict[db_index] = mibs.get_redis_pubsub(db, db.STATE_DB, self.get_key_pattern()) And everytime when the update_data is called again, it get the message from the psub and process. msg = pubsub.get_message() And outside, in the logic of the MIBUpdater, it calls update_data more frequently than reinit_data. A side-effect is, if reinit_data failed forever, the update_counter will not been cleaned, then update_data will not be called forever. self.update_counter = 0 So the problem is, at the begining, the psub is created at the first update_data and all things work well, until an unrecoverable issue happened, PHYSICAL_ENTITY_INFO|PSU * missing in the database (it's a pmon issue) This causes both reinit_data and update_data to be failed, because all of them finally call _update_per_namespace_data, which tries to cast an empty string '' to int and raises ValueError. Then the update_data is not called forever, because reinit_data will never success. But the previously established psubscription is still there, and no one gonna to consume it(the update_data is blocked), then Redis database memory starts to slowly leak. - How I did it Catch the exception during the loop of reinit_data, make sure the reinit_data of every physical_entity_updater will be called Clear message and cancel the subscription in the reinit_data, avoid the message accumulates in the redis subscription queue - How to verify it Tested on Cisco chassis, the memory is not leaking anymore.

- What I did Fixes the redis memory leak bug: #342 There's chance that the physical_entity_updater creates subscriptions to redis, and never consume the messages due to exceptions. Then the memory buffer(omem) of redis client starts to increase and never end, redis memory leaks. The reason is all 5 physical entity cache updaters inherit from PhysicalEntityCacheUpdater. In the first update_data, they initialize the psubscription to redis database. self.pub_sub_dict[db_index] = mibs.get_redis_pubsub(db, db.STATE_DB, self.get_key_pattern()) And everytime when the update_data is called again, it get the message from the psub and process. msg = pubsub.get_message() And outside, in the logic of the MIBUpdater, it calls update_data more frequently than reinit_data. A side-effect is, if reinit_data failed forever, the update_counter will not been cleaned, then update_data will not be called forever. self.update_counter = 0 So the problem is, at the begining, the psub is created at the first update_data and all things work well, until an unrecoverable issue happened, PHYSICAL_ENTITY_INFO|PSU * missing in the database (it's a pmon issue) This causes both reinit_data and update_data to be failed, because all of them finally call _update_per_namespace_data, which tries to cast an empty string '' to int and raises ValueError. Then the update_data is not called forever, because reinit_data will never success. But the previously established psubscription is still there, and no one gonna to consume it(the update_data is blocked), then Redis database memory starts to slowly leak. - How I did it Catch the exception during the loop of reinit_data, make sure the reinit_data of every physical_entity_updater will be called Clear message and cancel the subscription in the reinit_data, avoid the message accumulates in the redis subscription queue - How to verify it Tested on Cisco chassis, the memory is not leaking anymore.

yejianquan added the Bug label Dec 16, 2024

yejianquan mentioned this issue Dec 16, 2024

Fix redis memory leak issue in PhysicalEntityCacheUpdater #343

Merged

yejianquan self-assigned this Dec 16, 2024

yejianquan mentioned this issue Dec 17, 2024

redis omem leaking issue on T2 supervisor sonic-net/sonic-buildimage#20680

Closed

mssonicbld mentioned this issue Dec 18, 2024

[action] [PR:343] Fix redis memory leak issue in PhysicalEntityCacheUpdater (#343) #344

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis memory leak risk in PhysicalEntityCacheUpdater #342

Redis memory leak risk in PhysicalEntityCacheUpdater #342

yejianquan commented Dec 16, 2024

yejianquan commented Dec 16, 2024

yejianquan commented Dec 18, 2024

Redis memory leak risk in PhysicalEntityCacheUpdater #342

Redis memory leak risk in PhysicalEntityCacheUpdater #342

Comments

yejianquan commented Dec 16, 2024

yejianquan commented Dec 16, 2024

yejianquan commented Dec 18, 2024