Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design a solution for interrupting plans #1140

Open
alenkacz opened this issue Dec 5, 2019 · 1 comment
Open

Design a solution for interrupting plans #1140

alenkacz opened this issue Dec 5, 2019 · 1 comment

Comments

@alenkacz
Copy link
Contributor

alenkacz commented Dec 5, 2019

What would you like to be added:
The idea behind KEP-18 and the initial (intentionally simple) controller redesign was to let the plans run to completion and disallow anyone changing the instance state while in the middle of deployment. Currently this could even lead to plan not being triggered as it should and the update not being applied thus leaving the operator in an inconsistent state. We need to prevent that, either with webhook or by a more robust code in controllers.

Following are some bits and pieces from the discussion under the webhook PR: #1133 (comment)

@alenkacz : one way of interrupting would be to introduce some kind of force parameter that would indicate the person knows what they are doing. We could do this in a way similar to KEP-20 or some other way.

@mpereira :
In my experience one of the main sources of confusion for operators of SDK services are the implicit and opaque (sometimes undocumented) dependencies and conflicts between plans.

One thing that we'll need to think about (probably not in this PR) is how to deal with:

stuck plans (what is the behavior of starting a "deploy" plan when the previous "deploy" execution is stuck? for example, if a step can't complete because a container is crash-looping for some reason)
conflicting plans (what is the behavior of starting a "backup" plan when the "deploy" plan is either in progress or stuck on some step that could be seen as a dependency for any of the "backup" plan steps?)
For example, this SDK test shows a conflict scenario between the "deploy" and "recovery" plans in an SDK service.

The plan execution section of the SDK developer guide has some more (but not all) details about plan behavior.

Feedback from folks like @kaiwalyajoshi and @takirala might also be invaluable when thinking about improving plan behavior in the future and preventing past mistakes.

@mpereira :
Regarding interrupting plan executions, I think it does make sense. The SDK has
a concept of pausing plans. I think it'd be also useful to think about rolling
back plan executions.

From the SDK developer guide:

Normally, steps progress through statuses in the following order:

PENDING → PREPARED → STARTING → COMPLETE

The status of a phase or a plan is determined by examination of the step elements. A step may enter an ERROR state when its construction is malformed or whenever the service author determines it to be appropriate. The WAITING state occurs when the operator of the service indicates that an element should be paused. An operator might want to pause a deployment for a multitude of reasons, including unexpected failures during an update.

@alenkacz
Copy link
Contributor Author

alenkacz commented Dec 5, 2019

I think the next step here is to start a KEP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant