You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We ran into an issue today where a node had been partially deployed after/during some hardware diagnostics, and the daily backup-node_configs.sh script backed up incorrect, broken Puppet client configs and SSL certificates. Then additional reboots of this server resulted in node-config restores restoring the broken Puppet configs and SSL certificates.
A couple of recommendations to resolve this and make it easier to recover a node from this issue:
Update the backup-node_configs.sh script to only backup the node if xCAT reports the node's status is booted. If nodes are in a sit-and-spin type postscript their status should instead be postbooting, and if a postscript exits with a bad error code its status should be failed. So checking for a status of booted should allow us to only take backups if the node is in the expected state.
Would it be possible to keep one old version of each node's various node_configs backup files on the xCAT server? I don't have a proposed solution to implement this, but being able to at least see what changed with a node-configs backup file in the last few days would be helpful, as well as being able to restore old versions of the node-configs backup files.
The text was updated successfully, but these errors were encountered:
I'm somewhat surprised we hadn't run into this issue before. A couple of other examples of when this might be an issue include:
rolling reboots of a cluster
schedule, automated reboots of nodes within a cluster (e.g. weekly, monthly, etc) might take place close to the same time as daily backup-node_configs.sh are scheduled to run
We ran into an issue today where a node had been partially deployed after/during some hardware diagnostics, and the daily
backup-node_configs.sh
script backed up incorrect, broken Puppet client configs and SSL certificates. Then additional reboots of this server resulted in node-config restores restoring the broken Puppet configs and SSL certificates.A couple of recommendations to resolve this and make it easier to recover a node from this issue:
backup-node_configs.sh
script to only backup the node if xCAT reports the node'sstatus
isbooted
. If nodes are in a sit-and-spin type postscript their status should instead bepostbooting
, and if a postscript exits with a bad error code its status should befailed
. So checking for a status ofbooted
should allow us to only take backups if the node is in the expected state.The text was updated successfully, but these errors were encountered: