You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What would you like to be added:
The idea behind KEP-18 and the initial (intentionally simple) controller redesign was to let the plans run to completion and disallow anyone changing the instance state while in the middle of deployment. Currently this could even lead to plan not being triggered as it should and the update not being applied thus leaving the operator in an inconsistent state. We need to prevent that, either with webhook or by a more robust code in controllers.
Following are some bits and pieces from the discussion under the webhook PR: #1133 (comment)
@alenkacz : one way of interrupting would be to introduce some kind of force parameter that would indicate the person knows what they are doing. We could do this in a way similar to KEP-20 or some other way.
@mpereira :
In my experience one of the main sources of confusion for operators of SDK services are the implicit and opaque (sometimes undocumented) dependencies and conflicts between plans.
One thing that we'll need to think about (probably not in this PR) is how to deal with:
stuck plans (what is the behavior of starting a "deploy" plan when the previous "deploy" execution is stuck? for example, if a step can't complete because a container is crash-looping for some reason)
conflicting plans (what is the behavior of starting a "backup" plan when the "deploy" plan is either in progress or stuck on some step that could be seen as a dependency for any of the "backup" plan steps?)
For example, this SDK test shows a conflict scenario between the "deploy" and "recovery" plans in an SDK service.
The plan execution section of the SDK developer guide has some more (but not all) details about plan behavior.
Feedback from folks like @kaiwalyajoshi and @takirala might also be invaluable when thinking about improving plan behavior in the future and preventing past mistakes.
@mpereira :
Regarding interrupting plan executions, I think it does make sense. The SDK has
a concept of pausing plans. I think it'd be also useful to think about rolling
back plan executions.
From the SDK developer guide:
Normally, steps progress through statuses in the following order:
PENDING → PREPARED → STARTING → COMPLETE
The status of a phase or a plan is determined by examination of the step elements. A step may enter an ERROR state when its construction is malformed or whenever the service author determines it to be appropriate. The WAITING state occurs when the operator of the service indicates that an element should be paused. An operator might want to pause a deployment for a multitude of reasons, including unexpected failures during an update.
The text was updated successfully, but these errors were encountered:
What would you like to be added:
The idea behind KEP-18 and the initial (intentionally simple) controller redesign was to let the plans run to completion and disallow anyone changing the instance state while in the middle of deployment. Currently this could even lead to plan not being triggered as it should and the update not being applied thus leaving the operator in an inconsistent state. We need to prevent that, either with webhook or by a more robust code in controllers.
Following are some bits and pieces from the discussion under the webhook PR: #1133 (comment)
@alenkacz : one way of interrupting would be to introduce some kind of force parameter that would indicate the person knows what they are doing. We could do this in a way similar to KEP-20 or some other way.
@mpereira :
In my experience one of the main sources of confusion for operators of SDK services are the implicit and opaque (sometimes undocumented) dependencies and conflicts between plans.
One thing that we'll need to think about (probably not in this PR) is how to deal with:
stuck plans (what is the behavior of starting a "deploy" plan when the previous "deploy" execution is stuck? for example, if a step can't complete because a container is crash-looping for some reason)
conflicting plans (what is the behavior of starting a "backup" plan when the "deploy" plan is either in progress or stuck on some step that could be seen as a dependency for any of the "backup" plan steps?)
For example, this SDK test shows a conflict scenario between the "deploy" and "recovery" plans in an SDK service.
The plan execution section of the SDK developer guide has some more (but not all) details about plan behavior.
Feedback from folks like @kaiwalyajoshi and @takirala might also be invaluable when thinking about improving plan behavior in the future and preventing past mistakes.
@mpereira :
Regarding interrupting plan executions, I think it does make sense. The SDK has
a concept of pausing plans. I think it'd be also useful to think about rolling
back plan executions.
From the SDK developer guide:
Normally, steps progress through statuses in the following order:
PENDING → PREPARED → STARTING → COMPLETE
The status of a phase or a plan is determined by examination of the step elements. A step may enter an ERROR state when its construction is malformed or whenever the service author determines it to be appropriate. The WAITING state occurs when the operator of the service indicates that an element should be paused. An operator might want to pause a deployment for a multitude of reasons, including unexpected failures during an update.
The text was updated successfully, but these errors were encountered: