avoid some probably unnecessary watchdog-reboots with pacemaker_remot… #14

wenningerk · 2016-08-24T16:04:06Z

…e by using the knowledge of the cib most recently received before a connection-loss

This is related to and actually needs:
ClusterLabs/pacemaker#1130

I've observed a lot of cases where together with pacemaker-remote and using a watchdog
(without shared block-device) a lot of test-cases led to a watchdog-reboot on the remote-node.
But in cases where there are no active resources on the remote-node it would probably
not be needed to run into a suicide.
I put together a couple of testcases I've tried to drive to an at least improved outcome
seen from my pov:

without Cluster-Watcher:

sbd is satisfied as long as there is pacemaker_remoted running --> need to enable 
Pacemaker-Watcher

Behaviour with Pacemaker-Watcher before change in sbd and pacemaker:

1. the node running the remote-node-resource gets lost (e.g. virsh destroy) --> Watchdog-Reboot
    (timeout on the proxy connection that just died; takes way too long to retry on new connection)
2. graceful shutdown pacemaker on the cluster-nodes one by one --> Watchdog-Reboot
    (when last node goes down although resources got shut down in a clean way)
3. pcs resource disable {remote_node} --> Watchdog-Reboot
    (looses cib connection but actually all resources shut down in a clean way)
4. all cluster nodes are lost at once --> Watchdog-Reboot
    (yesss that is the one we want to happen)
5. all cluster-nodes but the one running the remote-node-resource are lost --> Watchdog-Reboot
    (would expect graceful shutdown of resources running on partial cluster without quorum)

Behaviour with Pacemaker-Patch setting TCP_USER_TIMEOUT to 1/2 of SBD-Watchdog-Timeout:

1. fixed as long as the remote-node-resource is taken over by other cluster-node quick enough

Behaviour with Pacemaker-Patch + SBD-Patch checking remaining cib-info for running resources
on remote-node on cib-connection-lost:

1. fixed as above
2. fixed as when cib-connection is finally lost all resources have been brought down
    gracefully on remote-node as well
3. fixed as resources on remote-node are brought down gracefully before the connection is cut
4. still a wanted Watchdog-Reboot as cib-connection is cut while resources are running on
    remote-node and no other cluster node is taking over
5. fixed as long as sbd-watchdog-timeout is long enough that remote-node-resource is
    shut down properly before watchdog

…e by using the knowledge of the cib most recently received before a connection-loss

beekhof · 2016-09-06T02:58:54Z

src/sbd-pacemaker.c

+         * on a remote-node without active resources
+         */
+        if (current_cib) {
+            cib_copy = copy_xml(current_cib);


Why make a copy if you're going to free the original at the end anyway?

avoid some probably unnecessary watchdog-reboots with pacemaker_remot…

68d88a7

…e by using the knowledge of the cib most recently received before a connection-loss

beekhof reviewed Sep 6, 2016
View reviewed changes

wenningerk referenced this pull request Jun 27, 2017

Simplify the systemd unit file by reading options from the environment

6c1c641

gao-yan mentioned this pull request Jul 18, 2017

Backward compatibility and another fix #25

Merged

wenningerk mentioned this pull request Apr 15, 2020

sync pacemakerd with sbd ClusterLabs/pacemaker#1957

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid some probably unnecessary watchdog-reboots with pacemaker_remot… #14

avoid some probably unnecessary watchdog-reboots with pacemaker_remot… #14

wenningerk commented Aug 24, 2016

beekhof Sep 6, 2016

avoid some probably unnecessary watchdog-reboots with pacemaker_remot… #14

Are you sure you want to change the base?

avoid some probably unnecessary watchdog-reboots with pacemaker_remot… #14

Conversation

wenningerk commented Aug 24, 2016

beekhof Sep 6, 2016

Choose a reason for hiding this comment