Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(CosmosFullNode): Self healing from crashlooping pod #206

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions api/v1/cosmosfullnode_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,12 @@ type FullNodeSpec struct {
// Used for debugging.
// +optional
InstanceOverrides map[string]InstanceOverridesSpec `json:"instanceOverrides"`

// Strategies for automatic recovery of faults and errors.
// SelfHealing is managed by a separate controller, SelfHealingController, in an effort to reduce
// complexity of the CosmosFullNodeController.
// +optional
SelfHealing *SelfHealingSpec `json:"selfHealing"`
}

type FullNodeType string
Expand Down
38 changes: 38 additions & 0 deletions api/v1/self_healing_types.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
package v1

import "k8s.io/apimachinery/pkg/util/intstr"

// SelfHealingSpec is part of a CosmosFullNode but is managed by a separate controller, SelfHealingController.
// This is an effort to reduce complexity in the CosmosFullNodeController.
type SelfHealingSpec struct {
// Determines when to destroy and recreate a replica (aka pod/pvc combo) that is crashlooping.
// Occasionally, data may become corrupt and the chain exits and cannot restart.
// This strategy only watches the pods' "node" containers running the `start` command.
//
// This pairs well with volumeClaimTemplate.autoDataSource and a ScheduledVolumeSnapshot resource.
// With this pairing, a new PVC is created with a recent VolumeSnapshot.
// Otherwise, ensure your snapshot, genesis, etc. creation are idempotent.
// (e.g. chain.snapshotURL and chain.genesisURL have stable urls)
//
// +optional
CrashLoopRecovery *CrashLoopRecovery `json:"crashLoopRecovery"`
}

type CrashLoopRecovery struct {
// How many healthy pods are required to trigger destroying a crashlooping pod and pvc.
// Set an integer or a percentage string such as 50%.
// Example: If you set to 80% and there are 10 total pods, at least 8 must be healthy to trigger the recovery.
// Fractional values are rounded down, but the minimum is 1.
// It's not recommended to use this feature with only 1 replica.
//
// This setting attempts to minimize false positives in order to detect data corruption vs.
// endless other reasons for unhealthy pods.
// If the majority of pods are unhealthy, then there's probably something else wrong, and recreating
// the pod and pvc will have no effect.
HealthyThreshold intstr.IntOrString `json:"healthyThreshold"`

// How many restarts to wait before destroying and recreating the unhealthy replica.
// Defaults to 5.
// +optional
RestartThreshold int32 `json:"restartThreshold"`
}
41 changes: 41 additions & 0 deletions api/v1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

41 changes: 41 additions & 0 deletions config/crd/bases/cosmos.strange.love_cosmosfullnodes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1466,6 +1466,47 @@ spec:
format: int32
minimum: 0
type: integer
selfHealing:
description: Strategies for automatic recovery of faults and errors.
SelfHealing is managed by a separate controller, SelfHealingController,
in an effort to reduce complexity of the CosmosFullNodeController.
properties:
crashLoopRecovery:
description: "Determines when to destroy and recreate a replica
(aka pod/pvc combo) that is crashlooping. Occasionally, data
may become corrupt and the chain exits and cannot restart. This
strategy only watches the pods' \"node\" containers running
the `start` command. \n This pairs well with volumeClaimTemplate.autoDataSource
and a ScheduledVolumeSnapshot resource. With this pairing, a
new PVC is created with a recent VolumeSnapshot. Otherwise,
ensure your snapshot, genesis, etc. creation are idempotent.
(e.g. chain.snapshotURL and chain.genesisURL have stable urls)"
properties:
healthyThreshold:
anyOf:
- type: integer
- type: string
description: "How many healthy pods are required to trigger
destroying a crashlooping pod and pvc. Set an integer or
a percentage string such as 50%. Example: If you set to
80% and there are 10 total pods, at least 8 must be healthy
to trigger the recovery. Fractional values are rounded down,
but the minimum is 1. It's not recommended to use this feature
with only 1 replica. \n This setting attempts to minimize
false positives in order to detect data corruption vs. endless
other reasons for unhealthy pods. If the majority of pods
are unhealthy, then there's probably something else wrong,
and recreating the pod and pvc will have no effect."
x-kubernetes-int-or-string: true
restartThreshold:
description: How many restarts to wait before destroying and
recreating the unhealthy replica. Defaults to 5.
format: int32
type: integer
required:
- healthyThreshold
type: object
type: object
service:
description: Configure Operator created services. A singe rpc service
is created for load balancing api, grpc, rpc, etc. requests. This
Expand Down