Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 153: Add Rollback support for Pravega Cluster #255

Merged
merged 18 commits into from
Sep 20, 2019
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions doc/rollback-cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Pravega cluster rollback

This document shows how to automated rollback of Pravega cluster is implemented by the operator while preserving the cluster's state and data whenever possible.

## Failing an Upgrade

An Upgrade can fail because of following reasons:

1. Incorrect configuration (wrong quota, permissions, limit ranges)
2. Network issues (ImagePullError)
3. K8s Cluster Issues.
4. Application issues (Application runtime misconfiguration or code bugs)

An upgrade failure can manifest through a Pod to staying in `Pending` state forever or continuously restarting or crashing (CrashLoopBackOff).
pbelgundi marked this conversation as resolved.
Show resolved Hide resolved
A component deployment failure needs to be tracked and mapped to "Upgrade Failure" for Pravega Cluster.
Here we try to fail-fast by explicitly checking for some common causes for deployment failure like image pull errors or CrashLoopBackOff State and failing the upgrade if any pod runs into this state during upgrade.

The following Pravega Cluster Status Condition indicates an Upgrade Failure:

```
ClusterConditionType: Error
Status: True
Reason: UpgradeFailed
Message: <Details of exception/cause of failure>
```

## Rollback Trigger

A Rollback is triggered by Upgrade Failure condition i.e the Cluster moving to
`ClusterConditionType: Error` and
`Reason:UpgradeFailed` state.

## Rollback Implementation
When Rollback is started cluster moves into ClusterCondition `RollbackInProgress`.
pbelgundi marked this conversation as resolved.
Show resolved Hide resolved
Once Rollback completes this condition is set to false.
pbelgundi marked this conversation as resolved.
Show resolved Hide resolved
The order in which the components are rolled back is the following:

1. BookKeeper
2. Pravega Segment Store
3. Pravega Controller

A new field `versionHistory` has been added to Pravega ClusterStatus to maintain history of previous cluster versions .
```
VersionHistory []string `json:"versionHistory,omitempty"`
```
Currently, operator only supports automated rollback to the previous cluster version.
Later, rollback to any other previous version(s), may be supported.

Rollback involves moving all components in the cluster back to the previous cluster version. As in case of upgrade, operator would rollback one component at a time and one pod at a time to maintain HA.
pbelgundi marked this conversation as resolved.
Show resolved Hide resolved

If Rollback completes successfully, cluster state goes back to `PodsReady` which would mean the cluster is now in a stable state.
pbelgundi marked this conversation as resolved.
Show resolved Hide resolved
If Rollback Fails, cluster would move to state `RollbackError` and User would be prompted for manual intervention.







## Pending tasks
pbelgundi marked this conversation as resolved.
Show resolved Hide resolved


## Prerequisites
60 changes: 59 additions & 1 deletion pkg/apis/pravega/v1alpha1/status.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@
package v1alpha1

import (
"fmt"
"log"
"time"

corev1 "k8s.io/api/core/v1"
Expand All @@ -21,6 +23,7 @@ type ClusterConditionType string
const (
ClusterConditionPodsReady ClusterConditionType = "PodsReady"
ClusterConditionUpgrading = "Upgrading"
ClusterConditionRollback = "RollbackInProgress"
ClusterConditionError = "Error"
)

Expand All @@ -36,6 +39,8 @@ type ClusterStatus struct {
// If the cluster is not upgrading, TargetVersion is empty.
TargetVersion string `json:"targetVersion,omitempty"`

VersionHistory []string `json:"versionHistory,omitempty"`

// Replicas is the number of desired replicas in the cluster
Replicas int32 `json:"replicas"`

Expand Down Expand Up @@ -78,7 +83,8 @@ type ClusterCondition struct {
LastTransitionTime string `json:"lastTransitionTime,omitempty"`
}

func (ps *ClusterStatus) InitConditions() {
func (ps *ClusterStatus) Init() {
// Initialise conditions
conditionTypes := []ClusterConditionType{
ClusterConditionPodsReady,
ClusterConditionUpgrading,
Expand All @@ -90,6 +96,12 @@ func (ps *ClusterStatus) InitConditions() {
ps.setClusterCondition(*c)
}
}

// Set current cluster version in version history,
// so if the first upgrade fails we can rollback to this version
if ps.VersionHistory == nil && ps.CurrentVersion != "" {
ps.VersionHistory = []string{ps.CurrentVersion}
}
}

func (ps *ClusterStatus) SetPodsReadyConditionTrue() {
Expand All @@ -112,6 +124,16 @@ func (ps *ClusterStatus) SetUpgradingConditionFalse() {
ps.setClusterCondition(*c)
}

func (ps *ClusterStatus) SetUpgradedReplicasForComponent(componentName string, updatedReplicas int32, totalReplicas int32) {
_, upgradeCondition := ps.GetClusterCondition(ClusterConditionUpgrading)
if upgradeCondition != nil && upgradeCondition.Status == corev1.ConditionTrue {
reason := fmt.Sprintf("Upgrading component: %s", componentName)
message := fmt.Sprintf("Upgraded Replicas: %v, Total Replicas: %v", updatedReplicas, totalReplicas)
c := newClusterCondition(ClusterConditionUpgrading, corev1.ConditionTrue, reason, message)
ps.setClusterCondition(*c)
}
}

func (ps *ClusterStatus) SetErrorConditionTrue(reason, message string) {
c := newClusterCondition(ClusterConditionError, corev1.ConditionTrue, reason, message)
ps.setClusterCondition(*c)
Expand All @@ -122,6 +144,15 @@ func (ps *ClusterStatus) SetErrorConditionFalse() {
ps.setClusterCondition(*c)
}

func (ps *ClusterStatus) SetRollbackConditionTrue() {
c := newClusterCondition(ClusterConditionRollback, corev1.ConditionTrue, "", "")
ps.setClusterCondition(*c)
}
func (ps *ClusterStatus) SetRollbackConditionFalse() {
c := newClusterCondition(ClusterConditionRollback, corev1.ConditionFalse, "", "")
ps.setClusterCondition(*c)
}

func newClusterCondition(condType ClusterConditionType, status corev1.ConditionStatus, reason, message string) *ClusterCondition {
return &ClusterCondition{
Type: condType,
Expand Down Expand Up @@ -165,3 +196,30 @@ func (ps *ClusterStatus) setClusterCondition(newCondition ClusterCondition) {

ps.Conditions[position] = *existingCondition
}

func (ps *ClusterStatus) AddToVersionHistory(version string) {
lastIndex := len(ps.VersionHistory) - 1
pbelgundi marked this conversation as resolved.
Show resolved Hide resolved
if version != "" && ps.VersionHistory[lastIndex] != version {
ps.VersionHistory = append(ps.VersionHistory, version)
log.Printf("Updating version history adding version %v", version)
}
}

func (ps *ClusterStatus) GetLastVersion() (previousVersion string, err error) {
if ps.VersionHistory == nil {
return "", fmt.Errorf("ERROR: No previous cluster version found")
}
len := len(ps.VersionHistory)
return ps.VersionHistory[len-1], nil
}

func (ps *ClusterStatus) HasUpgradeFailed() bool {
_, errorCondition := ps.GetClusterCondition(ClusterConditionError)
if errorCondition == nil {
return false
}
if errorCondition.Status == corev1.ConditionTrue && errorCondition.Reason == "UpgradeFailed" {
return true
}
return false
}
2 changes: 1 addition & 1 deletion pkg/controller/pravega/pravega_segmentstore.go
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ func MakeSegmentStoreStatefulSet(pravegaCluster *api.PravegaCluster) *appsv1.Sta
Replicas: &pravegaCluster.Spec.Pravega.SegmentStoreReplicas,
PodManagementPolicy: appsv1.OrderedReadyPodManagement,
UpdateStrategy: appsv1.StatefulSetUpdateStrategy{
Type: appsv1.RollingUpdateStatefulSetStrategyType,
Type: appsv1.OnDeleteStatefulSetStrategyType,
},
Template: MakeSegmentStorePodTemplate(pravegaCluster),
Selector: &metav1.LabelSelector{
Expand Down
27 changes: 26 additions & 1 deletion pkg/controller/pravegacluster/pravegacluster_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -138,11 +138,18 @@ func (r *ReconcilePravegaCluster) run(p *pravegav1alpha1.PravegaCluster) (err er
return fmt.Errorf("failed to sync cluster size: %v", err)
}

// Upgrade
err = r.syncClusterVersion(p)
if err != nil {
return fmt.Errorf("failed to sync cluster version: %v", err)
}

// Rollback
err = r.rollbackFailedUpgrade(p)
if err != nil {
return fmt.Errorf("Rollback attempt failed: %v", err)
}

err = r.reconcileClusterStatus(p)
if err != nil {
return fmt.Errorf("failed to reconcile cluster status: %v", err)
Expand All @@ -151,6 +158,7 @@ func (r *ReconcilePravegaCluster) run(p *pravegav1alpha1.PravegaCluster) (err er
}

func (r *ReconcilePravegaCluster) deployCluster(p *pravegav1alpha1.PravegaCluster) (err error) {

pbelgundi marked this conversation as resolved.
Show resolved Hide resolved
err = r.deployBookie(p)
if err != nil {
log.Printf("failed to deploy bookie: %v", err)
Expand All @@ -168,10 +176,12 @@ func (r *ReconcilePravegaCluster) deployCluster(p *pravegav1alpha1.PravegaCluste
log.Printf("failed to deploy segment store: %v", err)
return err
}

return nil
}

func (r *ReconcilePravegaCluster) deployController(p *pravegav1alpha1.PravegaCluster) (err error) {

pdb := pravega.MakeControllerPodDisruptionBudget(p)
controllerutil.SetControllerReference(p, pdb, r.scheme)
err = r.client.Create(context.TODO(), pdb)
Expand Down Expand Up @@ -251,6 +261,7 @@ func (r *ReconcilePravegaCluster) deploySegmentStore(p *pravegav1alpha1.PravegaC
}

func (r *ReconcilePravegaCluster) deployBookie(p *pravegav1alpha1.PravegaCluster) (err error) {

headlessService := pravega.MakeBookieHeadlessService(p)
controllerutil.SetControllerReference(p, headlessService, r.scheme)
err = r.client.Create(context.TODO(), headlessService)
Expand Down Expand Up @@ -439,7 +450,7 @@ func (r *ReconcilePravegaCluster) syncStatefulSetPvc(sts *appsv1.StatefulSet) er

func (r *ReconcilePravegaCluster) reconcileClusterStatus(p *pravegav1alpha1.PravegaCluster) error {

p.Status.InitConditions()
p.Status.Init()

expectedSize := util.GetClusterExpectedSize(p)
listOps := &client.ListOptions{
Expand Down Expand Up @@ -483,3 +494,17 @@ func (r *ReconcilePravegaCluster) reconcileClusterStatus(p *pravegav1alpha1.Prav
}
return nil
}

func (r *ReconcilePravegaCluster) rollbackFailedUpgrade(p *pravegav1alpha1.PravegaCluster) error {
if p.Status.HasUpgradeFailed() {
// start rollback to previous version
previousVersion, err := p.Status.GetLastVersion()
if err != nil {
return fmt.Errorf("Error retrieving previous cluster version %v", err)
}
log.Printf("Rolling back to last cluster version %v", previousVersion)
//Rollback cluster to previous version
return r.rollbackClusterVersion(p, previousVersion)
}
return nil
}
Loading