You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Seems to me that there are no retry mechanism in the EC2 OCF script.
AWS EC2 API calls can be throttle if more than 10000 API request a seconds are made.
In this case the script would not report any status and consider the resource in a bad status ending up with the STONITH device getting stopped.
Performing a "resource cleanup" operation starts the STONITH again in operational state after such failures.
/var/log/messages
2021-09-16T16:02:04.751248+00:00 external/ec2(res_AWS_STONITH)[31700]: info: status check for is
<-- Missing instance status report after "is" keyword
2021-09-16T16:02:04.760725+00:00 external/ec2(res_AWS_STONITH)[31694]: WARN: Already fenced (Instance status = ). Aborting fence attempt.
2021-09-16T16:02:13.742017+00:00 external/ec2(res_AWS_STONITH)[32004]: ERROR: Operation status failed: 1
Maybe some kind of fault tolerance would be nice to have I guess.
The text was updated successfully, but these errors were encountered:
IIRC, none of the stonith plugins does that, i.e. runs in a loop until the status is correct, so this would be a precedence. A question: how often do you check the status? If it's too often and the device (in this case aws) is flaky, then you may try increasing the interval.
#35 Addresses this.
The API bucket the agent uses is shared for the account's whole region and fairly small so simply extending the interval doesn't help much after a point.
Concerns: cluster-glue/lib/plugins/stonith/external/ec2
Seems to me that there are no retry mechanism in the EC2 OCF script.
AWS EC2 API calls can be throttle if more than 10000 API request a seconds are made.
In this case the script would not report any status and consider the resource in a bad status ending up with the STONITH device getting stopped.
Performing a "resource cleanup" operation starts the STONITH again in operational state after such failures.
/var/log/messages
2021-09-16T16:02:04.751248+00:00 external/ec2(res_AWS_STONITH)[31700]: info: status check for is
<-- Missing instance status report after "is" keyword
2021-09-16T16:02:04.760725+00:00 external/ec2(res_AWS_STONITH)[31694]: WARN: Already fenced (Instance status = ). Aborting fence attempt.
2021-09-16T16:02:13.742017+00:00 external/ec2(res_AWS_STONITH)[32004]: ERROR: Operation status failed: 1
Maybe some kind of fault tolerance would be nice to have I guess.
The text was updated successfully, but these errors were encountered: