Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid some probably unnecessary watchdog-reboots with pacemaker_remot… #14

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wenningerk
Copy link

…e by using the knowledge of the cib most recently received before a connection-loss

This is related to and actually needs:
ClusterLabs/pacemaker#1130

I've observed a lot of cases where together with pacemaker-remote and using a watchdog
(without shared block-device) a lot of test-cases led to a watchdog-reboot on the remote-node.
But in cases where there are no active resources on the remote-node it would probably
not be needed to run into a suicide.
I put together a couple of testcases I've tried to drive to an at least improved outcome
seen from my pov:

without Cluster-Watcher:

sbd is satisfied as long as there is pacemaker_remoted running --> need to enable 
Pacemaker-Watcher

Behaviour with Pacemaker-Watcher before change in sbd and pacemaker:

1. the node running the remote-node-resource gets lost (e.g. virsh destroy) --> Watchdog-Reboot
    (timeout on the proxy connection that just died; takes way too long to retry on new connection)
2. graceful shutdown pacemaker on the cluster-nodes one by one --> Watchdog-Reboot
    (when last node goes down although resources got shut down in a clean way)
3. pcs resource disable {remote_node} --> Watchdog-Reboot
    (looses cib connection but actually all resources shut down in a clean way)
4. all cluster nodes are lost at once --> Watchdog-Reboot
    (yesss that is the one we want to happen)
5. all cluster-nodes but the one running the remote-node-resource are lost --> Watchdog-Reboot
    (would expect graceful shutdown of resources running on partial cluster without quorum)

Behaviour with Pacemaker-Patch setting TCP_USER_TIMEOUT to 1/2 of SBD-Watchdog-Timeout:

1. fixed as long as the remote-node-resource is taken over by other cluster-node quick enough

Behaviour with Pacemaker-Patch + SBD-Patch checking remaining cib-info for running resources
on remote-node on cib-connection-lost:

1. fixed as above
2. fixed as when cib-connection is finally lost all resources have been brought down
    gracefully on remote-node as well
3. fixed as resources on remote-node are brought down gracefully before the connection is cut
4. still a wanted Watchdog-Reboot as cib-connection is cut while resources are running on
    remote-node and no other cluster node is taking over
5. fixed as long as sbd-watchdog-timeout is long enough that remote-node-resource is
    shut down properly before watchdog

…e by using the knowledge of the cib most recently received before a connection-loss
* on a remote-node without active resources
*/
if (current_cib) {
cib_copy = copy_xml(current_cib);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make a copy if you're going to free the original at the end anyway?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants