Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ec2 ocf resource retry #33

Open
nasjomach opened this issue Sep 30, 2021 · 2 comments
Open

ec2 ocf resource retry #33

nasjomach opened this issue Sep 30, 2021 · 2 comments

Comments

@nasjomach
Copy link

nasjomach commented Sep 30, 2021

Concerns: cluster-glue/lib/plugins/stonith/external/ec2

Seems to me that there are no retry mechanism in the EC2 OCF script.
AWS EC2 API calls can be throttle if more than 10000 API request a seconds are made.
In this case the script would not report any status and consider the resource in a bad status ending up with the STONITH device getting stopped.

Performing a "resource cleanup" operation starts the STONITH again in operational state after such failures.

/var/log/messages
2021-09-16T16:02:04.751248+00:00 external/ec2(res_AWS_STONITH)[31700]: info: status check for is
<-- Missing instance status report after "is" keyword

2021-09-16T16:02:04.760725+00:00 external/ec2(res_AWS_STONITH)[31694]: WARN: Already fenced (Instance status = ). Aborting fence attempt.
2021-09-16T16:02:13.742017+00:00 external/ec2(res_AWS_STONITH)[32004]: ERROR: Operation status failed: 1

Maybe some kind of fault tolerance would be nice to have I guess.

@dmuhamedagic
Copy link
Collaborator

dmuhamedagic commented Oct 6, 2021

IIRC, none of the stonith plugins does that, i.e. runs in a loop until the status is correct, so this would be a precedence. A question: how often do you check the status? If it's too often and the device (in this case aws) is flaky, then you may try increasing the interval.

@Thr3d
Copy link

Thr3d commented Mar 28, 2022

#35 Addresses this.
The API bucket the agent uses is shared for the account's whole region and fairly small so simply extending the interval doesn't help much after a point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants