Design a solution for interrupting plans #1140

alenkacz · 2019-12-05T14:32:30Z

What would you like to be added:
The idea behind KEP-18 and the initial (intentionally simple) controller redesign was to let the plans run to completion and disallow anyone changing the instance state while in the middle of deployment. Currently this could even lead to plan not being triggered as it should and the update not being applied thus leaving the operator in an inconsistent state. We need to prevent that, either with webhook or by a more robust code in controllers.

Following are some bits and pieces from the discussion under the webhook PR: #1133 (comment)

@alenkacz : one way of interrupting would be to introduce some kind of force parameter that would indicate the person knows what they are doing. We could do this in a way similar to KEP-20 or some other way.

@mpereira :
In my experience one of the main sources of confusion for operators of SDK services are the implicit and opaque (sometimes undocumented) dependencies and conflicts between plans.

One thing that we'll need to think about (probably not in this PR) is how to deal with:

stuck plans (what is the behavior of starting a "deploy" plan when the previous "deploy" execution is stuck? for example, if a step can't complete because a container is crash-looping for some reason)
conflicting plans (what is the behavior of starting a "backup" plan when the "deploy" plan is either in progress or stuck on some step that could be seen as a dependency for any of the "backup" plan steps?)
For example, this SDK test shows a conflict scenario between the "deploy" and "recovery" plans in an SDK service.

The plan execution section of the SDK developer guide has some more (but not all) details about plan behavior.

Feedback from folks like @kaiwalyajoshi and @takirala might also be invaluable when thinking about improving plan behavior in the future and preventing past mistakes.

@mpereira :
Regarding interrupting plan executions, I think it does make sense. The SDK has
a concept of pausing plans. I think it'd be also useful to think about rolling
back plan executions.

From the SDK developer guide:

Normally, steps progress through statuses in the following order:

PENDING → PREPARED → STARTING → COMPLETE

The status of a phase or a plan is determined by examination of the step elements. A step may enter an ERROR state when its construction is malformed or whenever the service author determines it to be appropriate. The WAITING state occurs when the operator of the service indicates that an element should be paused. An operator might want to pause a deployment for a multitude of reasons, including unexpected failures during an update.

alenkacz · 2019-12-05T14:32:44Z

I think the next step here is to start a KEP

alenkacz added priority/high kind/enhancement labels Dec 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design a solution for interrupting plans #1140

Design a solution for interrupting plans #1140

alenkacz commented Dec 5, 2019

alenkacz commented Dec 5, 2019

Design a solution for interrupting plans #1140

Design a solution for interrupting plans #1140

Comments

alenkacz commented Dec 5, 2019

alenkacz commented Dec 5, 2019