Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix redis memory leak issue in PhysicalEntityCacheUpdater #343

Merged

Conversation

yejianquan
Copy link
Contributor

@yejianquan yejianquan commented Dec 16, 2024

- What I did
Fixes the redis memory leak bug:
#342
There's chance that the physical_entity_updater creates subscriptions to redis, and never consume the messages due to exceptions.
Then the memory buffer(omem) of redis client starts to increase and never end, redis memory leaks.

The reason is all 5 physical entity cache updaters inherit from PhysicalEntityCacheUpdater.
In the first update_data, they initialize the psubscription to redis database.

self.pub_sub_dict[db_index] = mibs.get_redis_pubsub(db, db.STATE_DB, self.get_key_pattern())

And everytime when the update_data is called again, it get the message from the psub and process.
msg = pubsub.get_message()

And outside, in the logic of the MIBUpdater, it calls update_data more frequently than reinit_data.
A side-effect is, if reinit_data failed forever, the update_counter will not been cleaned, then update_data will not be called forever.

self.update_counter = 0

So the problem is, at the begining, the psub is created at the first update_data and all things work well, until an unrecoverable issue happened,
PHYSICAL_ENTITY_INFO|PSU * missing in the database (it's a pmon issue)
This causes both reinit_data and update_data to be failed, because all of them finally call _update_per_namespace_data, which tries to cast an empty string '' to int and raises ValueError.
Then the update_data is not called forever, because reinit_data will never success.
But the previously established psubscription is still there, and no one gonna to consume it(the update_data is blocked), then Redis database memory starts to slowly leak.

- How I did it

  1. Catch the exception during the loop of reinit_data, make sure the reinit_data of every physical_entity_updater will be called
  2. Clear message and cancel the subscription in the reinit_data, avoid the message accumulates in the redis subscription queue

- How to verify it
Tested on Cisco chassis, the memory is not leaking anymore.

- Description for the changelog

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yejianquan yejianquan assigned abdosi and unassigned abdosi Dec 16, 2024
@yejianquan yejianquan requested a review from abdosi December 16, 2024 11:23
@yejianquan
Copy link
Contributor Author

@yejianquan yejianquan requested a review from liuh-80 December 16, 2024 11:47
@yejianquan yejianquan force-pushed the jianquanye/fix_memory_leak branch from 5463f95 to a8c0cfb Compare December 17, 2024 04:19
@yejianquan yejianquan force-pushed the jianquanye/fix_memory_leak branch from d77c535 to 1aff8bf Compare December 17, 2024 08:38
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@sonic-net sonic-net deleted a comment from mssonicbld Dec 17, 2024
@sonic-net sonic-net deleted a comment from azure-pipelines bot Dec 17, 2024
@sonic-net sonic-net deleted a comment from mssonicbld Dec 17, 2024
@sonic-net sonic-net deleted a comment from azure-pipelines bot Dec 17, 2024
@sonic-net sonic-net deleted a comment from mssonicbld Dec 17, 2024
@sonic-net sonic-net deleted a comment from azure-pipelines bot Dec 17, 2024
@sonic-net sonic-net deleted a comment from mssonicbld Dec 17, 2024
@sonic-net sonic-net deleted a comment from azure-pipelines bot Dec 17, 2024
@sonic-net sonic-net deleted a comment from mssonicbld Dec 17, 2024
@sonic-net sonic-net deleted a comment from azure-pipelines bot Dec 17, 2024
@sonic-net sonic-net deleted a comment from azure-pipelines bot Dec 17, 2024
@sonic-net sonic-net deleted a comment from mssonicbld Dec 17, 2024
@sonic-net sonic-net deleted a comment from azure-pipelines bot Dec 17, 2024
@sonic-net sonic-net deleted a comment from mssonicbld Dec 17, 2024
@sonic-net sonic-net deleted a comment from mssonicbld Dec 17, 2024
@sonic-net sonic-net deleted a comment from azure-pipelines bot Dec 17, 2024
@sonic-net sonic-net deleted a comment from mssonicbld Dec 17, 2024
@sonic-net sonic-net deleted a comment from azure-pipelines bot Dec 17, 2024
@sonic-net sonic-net deleted a comment from mssonicbld Dec 17, 2024
@sonic-net sonic-net deleted a comment from azure-pipelines bot Dec 17, 2024
@sonic-net sonic-net deleted a comment from mssonicbld Dec 17, 2024
@sonic-net sonic-net deleted a comment from azure-pipelines bot Dec 17, 2024
@yejianquan
Copy link
Contributor Author

yejianquan commented Dec 17, 2024

Removed all the sonicbld comment due to adding new test cases.
New test cases have been added appropriately.

@yejianquan
Copy link
Contributor Author

Hi @Junchao-Mellanox @liuh-80 @SuvarnaMeenakshi , could you help to review this PR?

@yejianquan
Copy link
Contributor Author

@qiluo-msft for viz

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@SuvarnaMeenakshi SuvarnaMeenakshi merged commit 6a5c96d into sonic-net:master Dec 18, 2024
5 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-snmpagent that referenced this pull request Dec 18, 2024
)

- What I did
Fixes the redis memory leak bug:
sonic-net#342
There's chance that the physical_entity_updater creates subscriptions to redis, and never consume the messages due to exceptions.
Then the memory buffer(omem) of redis client starts to increase and never end, redis memory leaks.
The reason is all 5 physical entity cache updaters inherit from PhysicalEntityCacheUpdater.
In the first update_data, they initialize the psubscription to redis database.
 self.pub_sub_dict[db_index] = mibs.get_redis_pubsub(db, db.STATE_DB, self.get_key_pattern()) 
And everytime when the update_data is called again, it get the message from the psub and process.
 msg = pubsub.get_message() 
And outside, in the logic of the MIBUpdater, it calls update_data more frequently than reinit_data.
A side-effect is, if reinit_data failed forever, the update_counter will not been cleaned, then update_data will not be called forever.
 self.update_counter = 0 
So the problem is, at the begining, the psub is created at the first update_data and all things work well, until an unrecoverable issue happened,
PHYSICAL_ENTITY_INFO|PSU * missing in the database (it's a pmon issue)
This causes both reinit_data and update_data to be failed, because all of them finally call _update_per_namespace_data, which tries to cast an empty string '' to int and raises ValueError.
Then the update_data is not called forever, because reinit_data will never success.
But the previously established psubscription is still there, and no one gonna to consume it(the update_data is blocked), then Redis database memory starts to slowly leak.

- How I did it

Catch the exception during the loop of reinit_data, make sure the reinit_data of every physical_entity_updater will be called
Clear message and cancel the subscription in the reinit_data, avoid the message accumulates in the redis subscription queue
- How to verify it
Tested on Cisco chassis, the memory is not leaking anymore.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202405: #344

mssonicbld pushed a commit that referenced this pull request Dec 18, 2024
- What I did
Fixes the redis memory leak bug:
#342
There's chance that the physical_entity_updater creates subscriptions to redis, and never consume the messages due to exceptions.
Then the memory buffer(omem) of redis client starts to increase and never end, redis memory leaks.
The reason is all 5 physical entity cache updaters inherit from PhysicalEntityCacheUpdater.
In the first update_data, they initialize the psubscription to redis database.
 self.pub_sub_dict[db_index] = mibs.get_redis_pubsub(db, db.STATE_DB, self.get_key_pattern()) 
And everytime when the update_data is called again, it get the message from the psub and process.
 msg = pubsub.get_message() 
And outside, in the logic of the MIBUpdater, it calls update_data more frequently than reinit_data.
A side-effect is, if reinit_data failed forever, the update_counter will not been cleaned, then update_data will not be called forever.
 self.update_counter = 0 
So the problem is, at the begining, the psub is created at the first update_data and all things work well, until an unrecoverable issue happened,
PHYSICAL_ENTITY_INFO|PSU * missing in the database (it's a pmon issue)
This causes both reinit_data and update_data to be failed, because all of them finally call _update_per_namespace_data, which tries to cast an empty string '' to int and raises ValueError.
Then the update_data is not called forever, because reinit_data will never success.
But the previously established psubscription is still there, and no one gonna to consume it(the update_data is blocked), then Redis database memory starts to slowly leak.

- How I did it

Catch the exception during the loop of reinit_data, make sure the reinit_data of every physical_entity_updater will be called
Clear message and cancel the subscription in the reinit_data, avoid the message accumulates in the redis subscription queue
- How to verify it
Tested on Cisco chassis, the memory is not leaking anymore.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants